Files

Marius Mutu ab160b628d feat(ocr): Add validation system and CLIENT CUI extraction

OCR Data Extraction Validation System:
- Add 7 validation rules (amount range, TVA ratio, payment sum, etc.)
- Add Medium preprocessing to replace Heavy (fixes digit concatenation)
- Add validation warnings to API responses
- Flag receipts needing manual review (needs_manual_review field)
- Add database migration for needs_manual_review column

CLIENT CUI Extraction Improvements:
- Support all format variations: CIF CLIENT:, CLIENT C.U.I/C.I.F., etc.
- Handle OCR errors (R0 vs RO, C1F vs CIF)
- Add client_name, client_cui, client_address to API response
- Add validation fields to API response (was missing)

QA Review: 12 issues found, 9 fixed (5 errors + 4 warnings)
- Fixed type safety in validation rules
- Fixed ZeroDivisionError risk
- Fixed schema mismatch (Optional[bool] for needs_manual_review)
- All 37 unit tests passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-30 19:12:52 +02:00

52 KiB

Raw Blame History

Feature Specification: OCR Data Extraction Validation System

Feature ID: bon-ocr-validation Priority: Critical (P0 - Production Bug) Complexity: High Estimated Effort: 2-3 days Created: 2025-12-30 Module: Data Entry (backend/modules/data_entry/)

Overview

Fix critical OCR data extraction issue where PaddleOCR Heavy preprocessing (88% confidence) overwrites correct Light OCR (98% confidence) data with garbage values, causing 10,000x magnitude errors in production receipts.

Value Proposition: Prevent incorrect financial data from entering the system, reduce manual corrections, improve user trust in OCR accuracy.

Problem Statement

Current Behavior (BROKEN)

The OCR processing pipeline (backend/modules/data_entry/services/ocr_service.py) uses a 3-step adaptive approach:

Step 1: PaddleOCR + Light preprocessing (fast, high confidence)
Step 2: PaddleOCR + Heavy preprocessing (faded receipts)
Step 3: Tesseract (complement missing fields only)

Critical Bug: The _merge_extractions() method (lines 240-386) blindly prefers higher OCR confidence scores WITHOUT validating actual extracted values.

Real Production Example (Five-Holding Receipt)

Field	Light OCR (98%) ✅	Heavy OCR (88%) ❌	Final Result ❌
TOTAL	85.99 LEI	859,762.16 LEI	859,762.16 (10,000x error!)
TVA	14.92 LEI	149,214.92 LEI	149,214.92 (10,000x error!)
CUI	R010562600	(not found)	R010562600
Date	2025-10-11	2025-10-11	2025-10-11
Confidence	98%	88%	88% (wrong source!)

Root Cause: Heavy preprocessing causes digit concatenation on high-quality PDFs. The binarization and morphological operations (lines 153-164 in image_preprocessor.py) merge adjacent numbers, creating garbage values.

Impact

Data Integrity: Incorrect amounts enter accounting system
User Trust: Users lose confidence in OCR accuracy
Manual Work: Requires manual verification of ALL OCR extractions
Financial Risk: Wrong amounts could be approved without review

User Stories

1. As a user uploading a clear PDF receipt

I want OCR to extract correct values from the first pass So that I don't have to manually correct obvious errors

Acceptance Criteria:

Light OCR correctly extracts 85.99 LEI (not 859,762.16)
Heavy OCR is skipped when Light OCR confidence >= 90%
No 10,000x magnitude errors

2. As a user submitting a receipt with warnings

I want to be able to save receipts with validation warnings So that I can submit for review even if OCR isn't perfect

Acceptance Criteria:

Save button works with warnings (not blocked)
Receipt marked with needs_manual_review=True
Warnings displayed clearly in UI

3. As a supervisor reviewing receipts

I want to see which receipts need manual review So that I can prioritize validation efforts

Acceptance Criteria:

Filter by "Needs Review" flag
Validation warnings shown in detail view
Clear indication of which fields are suspicious

4. As a system validating cross-field data

I want to validate CARD + NUMERAR = TOTAL So that payment methods match the total amount

Acceptance Criteria:

Cross-validation: sum of payment methods = TOTAL (±0.02 RON tolerance)
If mismatch, flag for review
Auto-correct TOTAL from payment sum if confidence < 80%

5. As a system validating TVA entries

I want to validate Σ(TVA entries) = TVA TOTAL So that individual TVA lines match the total TVA

Acceptance Criteria:

Cross-validation: sum of TVA entries = TVA TOTAL (±0.02 RON tolerance)
TVA rate validation (5-24% of TOTAL)
If mismatch, flag for review

Functional Requirements

Core Requirements (Must-Have)

1. Multi-Layer Validation Pipeline

FR-1.1: Absolute value sanity checks

Amount range: 0.01 - 100,000 RON
Max 2 decimal places
Date: not in future, not older than 10 years (2015+)
CUI: 6-10 digits, valid Mod 11 checksum

FR-1.2: Cross-field correlation validation

TVA: 5-24% of TOTAL amount (Romanian rates: 5%, 9%, 11%, 19%, 21%)
Payment methods: CARD + NUMERAR = TOTAL (±0.02 RON tolerance)
Inter-OCR consistency: flag if values differ >10x between engines

FR-1.3: Auto-correction logic

If TOTAL is obviously wrong (>10x payment sum), use payment sum
If TVA > TOTAL, recalculate TOTAL from TVA using reverse formula
Preserve high-confidence values from Light OCR over low-confidence Heavy OCR

FR-1.4: Validation result structure

@dataclass
class ValidationResult:
    is_valid: bool
    warnings: List[ValidationWarning]  # Non-blocking issues
    errors: List[ValidationError]      # Blocking issues (none for now)
    corrected_fields: Dict[str, Any]   # Auto-corrected values
    needs_manual_review: bool          # Flag for supervisor

2. Replace Heavy with Medium OCR

FR-2.1: Remove preprocess_heavy() method

Current Heavy: aggressive binarization causes digit concatenation
Reason: Destroys high-quality PDFs while trying to recover faded receipts

FR-2.2: Add preprocess_medium() method

Moderate contrast enhancement (CLAHE clipLimit=2.0)
Light denoising (fastNlMeansDenoising h=6)
NO binarization, NO morphological operations
Preserve text boundaries on clear images

FR-2.3: Update OCR pipeline

Step 1: Light preprocessing (unchanged)
Step 2: Medium preprocessing (replaces Heavy)
Step 3: Tesseract (unchanged)

3. Enhanced CUI Extraction

FR-3.1: Romanian CIF validation algorithm

Implement Mod 11 checksum validation
Control digit formula: sum(digit[i] * weight[i]) % 11
Weights: [7, 5, 3, 2, 1, 7, 5, 3, 2] (right-to-left)

FR-3.2: CUI format normalization

Always add "RO" prefix if missing
Remove spaces, dashes, dots
Validate length: 6-10 digits

FR-3.3: Improved regex patterns

# Add OCR-tolerant patterns (current patterns are too strict)
CUI_OCR_TOLERANT_PATTERNS = [
    r'CIF[:\s]*R[O0]?\s*(\d[\d\s]{5,9})',  # Spaces in CUI
    r'C[I1]F[:\s]*(\d[\d\s]{6,10})',        # C1F (I→1 OCR error)
    r'C\.?\s*[I1]\.?\s*F\.?[:\s]*(\d+)',   # C. I. F. (spaced)
]

4. User Requirements Integration

FR-4.1: Non-blocking validation warnings

Save button enabled even with warnings
User can override and submit
Warnings displayed clearly in UI

FR-4.2: Manual review flag

Database field: receipts.needs_manual_review (BOOLEAN)
Set to TRUE if:
- Any validation warning present
- Overall confidence < 85%
- Cross-validation fails

FR-4.3: Apply to new uploads only

No reprocessing of existing receipts
Validation runs on OCR extraction (POST /api/ocr/extract)
Migration: add column with default NULL (not FALSE)

Secondary Requirements (Nice-to-Have)

FR-S1: Validation confidence scoring

Each validation rule contributes to score
Overall validation confidence: weighted average
Display in UI alongside OCR confidence

FR-S2: Validation rule configurability

Move hardcoded thresholds to config
Allow per-company customization
Admin UI to adjust rules

Technical Requirements

Files to Create

1. `backend/modules/data_entry/services/ocr/validation.py`

Purpose: Validation utilities and rule engine Size: ~400 lines Key Classes:

ValidationRule (base class)
AmountRangeRule, TVARatioRule, PaymentSumRule, CUIChecksumRule
OCRValidationEngine (orchestrator)

Example:

@dataclass
class ValidationWarning:
    """Non-blocking validation warning."""
    field: str
    rule: str
    message: str
    severity: str  # 'low', 'medium', 'high'
    suggested_value: Optional[Any] = None

class ValidationRule(ABC):
    """Base validation rule."""
    @abstractmethod
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        pass

class AmountRangeRule(ValidationRule):
    """Validate amount is in reasonable range (0.01 - 100,000 RON)."""
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.amount:
            if extraction.amount < Decimal('0.01'):
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='amount_range',
                    message=f'Amount {extraction.amount} is too small (< 0.01 RON)',
                    severity='high'
                ))
            elif extraction.amount > Decimal('100000'):
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='amount_range',
                    message=f'Amount {extraction.amount} exceeds limit (> 100,000 RON)',
                    severity='high'
                ))
        return warnings

class OCRValidationEngine:
    """Orchestrate all validation rules."""
    def __init__(self):
        self.rules = [
            AmountRangeRule(),
            TVARatioRule(),
            PaymentSumRule(),
            InterOCRConsistencyRule(),
            CUIChecksumRule(),
            DateValidityRule(),
        ]

    def validate(self, extraction: ExtractionResult) -> ValidationResult:
        """Run all validation rules and return result."""
        all_warnings = []
        corrected_fields = {}

        for rule in self.rules:
            warnings = rule.validate(extraction)
            all_warnings.extend(warnings)

            # Apply auto-corrections
            corrections = rule.auto_correct(extraction)
            corrected_fields.update(corrections)

        needs_review = (
            len(all_warnings) > 0 or
            extraction.overall_confidence < 0.85
        )

        return ValidationResult(
            is_valid=True,  # Never block (warnings only)
            warnings=all_warnings,
            errors=[],
            corrected_fields=corrected_fields,
            needs_manual_review=needs_review
        )

2. `backend/modules/data_entry/tests/test_ocr_validation.py`

Purpose: Unit tests for validation rules Size: ~300 lines Coverage Target: >90%

Test Cases:

test_amount_range_valid() - 85.99 RON passes
test_amount_range_too_high() - 859,762.16 fails
test_tva_ratio_valid() - 14.92/85.99 = 17.3% passes
test_tva_ratio_too_high() - 149,214.92/859,762.16 = 17.3% but amounts wrong
test_payment_sum_matches() - CARD 50 + NUMERAR 35.99 = TOTAL 85.99
test_cui_checksum_valid() - R010562600 passes Mod 11
test_cui_checksum_invalid() - R010562601 fails Mod 11
test_inter_ocr_consistency() - 85.99 vs 859,762.16 = 10,000x flag

3. `backend/modules/data_entry/tests/test_ocr_validation_integration.py`

Purpose: Integration tests with full OCR pipeline Size: ~200 lines

Test Cases:

test_five_holding_receipt() - Real production case (85.99 not 859,762.16)
test_clear_pdf_uses_light_ocr() - High-quality PDF skips Heavy
test_faded_receipt_uses_medium_ocr() - Thermal receipt uses Medium
test_validation_warnings_in_response() - API returns warnings
test_manual_review_flag_set() - Flag set when confidence < 85%

Files to Modify

1. `backend/modules/data_entry/services/ocr_service.py`

Changes: ~200 lines modified, ~100 lines added

Key Modifications:

A. Replace _merge_extractions() (lines 240-386) with validation-aware version:

def _merge_extractions(
    self,
    light: Optional[ExtractionResult],
    medium: Optional[ExtractionResult]  # Renamed from 'tesseract'
) -> ExtractionResult:
    """
    Merge extractions with VALIDATION-AWARE logic.

    NEW Strategy:
    1. Run validation on both extractions
    2. Prefer extraction with FEWER warnings (not just higher confidence)
    3. For each field, pick value that passes validation
    4. Flag inter-OCR inconsistencies (>10x difference)
    """
    from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine

    validator = OCRValidationEngine()

    # Validate both extractions
    light_validation = validator.validate(light) if light else None
    medium_validation = validator.validate(medium) if medium else None

    result = ExtractionResult()

    # === AMOUNT (with validation check) ===
    if light.amount and medium.amount:
        # Check for 10x inconsistency
        ratio = max(light.amount, medium.amount) / min(light.amount, medium.amount)
        if ratio > 10:
            print(f"[Merge] WARNING: Inter-OCR inconsistency: {light.amount} vs {medium.amount} ({ratio:.0f}x)", flush=True)
            # Prefer value that passes validation
            light_warnings = [w for w in light_validation.warnings if w.field == 'amount']
            medium_warnings = [w for w in medium_validation.warnings if w.field == 'amount']

            if len(light_warnings) < len(medium_warnings):
                result.amount = light.amount
                result.confidence_amount = light.confidence_amount
                print(f"[Merge] Using Light OCR amount: {light.amount} (fewer warnings)", flush=True)
            else:
                result.amount = medium.amount
                result.confidence_amount = medium.confidence_amount
                print(f"[Merge] Using Medium OCR amount: {medium.amount} (fewer warnings)", flush=True)
        else:
            # Normal merge: prefer higher confidence
            if light.confidence_amount >= medium.confidence_amount:
                result.amount = light.amount
                result.confidence_amount = light.confidence_amount
            else:
                result.amount = medium.amount
                result.confidence_amount = medium.confidence_amount
    elif light.amount:
        result.amount = light.amount
        result.confidence_amount = light.confidence_amount
    elif medium.amount:
        result.amount = medium.amount
        result.confidence_amount = medium.confidence_amount

    # ... (similar logic for other fields)

    return result

B. Add preprocess_medium() call (replace Heavy):

# Line ~130: Replace preprocess_heavy with preprocess_medium
print("=" * 60, flush=True)
print("[OCR] STEP 2: PaddleOCR + Medium preprocessing", flush=True)
print("=" * 60, flush=True)
medium_img = self.preprocessor.preprocess_medium(image)  # NEW

try:
    paddle_medium = self.ocr_engine._paddle_recognize(medium_img)
    # ... rest of processing

C. Add validation to final result:

# Line ~204: Add validation before returning
if extraction:
    extraction = self._final_validation(extraction)

    # NEW: Run validation engine
    from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine
    validator = OCRValidationEngine()
    validation_result = validator.validate(extraction)

    # Apply auto-corrections
    for field, value in validation_result.corrected_fields.items():
        setattr(extraction, field, value)

    # Store validation warnings (add to ExtractionResult)
    extraction.validation_warnings = validation_result.warnings
    extraction.needs_manual_review = validation_result.needs_manual_review

2. `backend/modules/data_entry/services/ocr_extractor.py`

Changes: ~50 lines modified, ~30 lines added

Key Modifications:

A. Add validation fields to ExtractionResult (lines 10-50):

@dataclass
class ExtractionResult:
    """Structured extraction result from receipt."""
    # ... existing fields ...

    # NEW: Validation results
    validation_warnings: List[dict] = field(default_factory=list)  # List of warnings
    needs_manual_review: bool = False  # Flag for supervisor review

    # NEW: Inter-OCR comparison data
    inter_ocr_ratio: Optional[float] = None  # Ratio between Light/Heavy values
    inter_ocr_source_used: Optional[str] = None  # 'light' or 'medium'

B. Fix CLIENT CUI patterns (lines 253-272):

# Current patterns are too strict - add OCR-tolerant versions
CLIENT_CUI_PATTERNS = [
    # ... existing patterns ...

    # NEW: OCR-tolerant patterns
    (r'CLIENT\s+C[I1UO]F\s*[:/]?\s*(?:R[O0])?(\d[\d\s]{5,9})', 0.96),  # Spaces in CUI
    (r'C[I1]F\s+CLIENT\s*[:/]?\s*(?:R[O0])?(\d[\d\s]{5,9})', 0.96),    # Reversed format
    (r'CLIENT.*?(?:R[O0])?(\d{6,10})\s*\n', 0.90),                      # CUI on next line
]

C. Add CUI normalization and validation:

def _normalize_cui(self, cui: Optional[str]) -> Optional[str]:
    """Normalize CUI format and validate checksum."""
    if not cui:
        return None

    # Remove non-digits
    digits = re.sub(r'\D', '', cui)

    # Validate length
    if not (6 <= len(digits) <= 10):
        return None

    # Validate Mod 11 checksum (Romanian CIF algorithm)
    if not self._validate_cui_checksum(digits):
        print(f"[CUI Validation] Invalid checksum: {digits}", flush=True)
        return None

    # Add RO prefix
    return f"RO{digits}"

def _validate_cui_checksum(self, digits: str) -> bool:
    """Validate Romanian CIF Mod 11 checksum."""
    if len(digits) < 2:
        return False

    # Weights: 7, 5, 3, 2, 1, 7, 5, 3, 2 (right-to-left)
    weights = [7, 5, 3, 2, 1, 7, 5, 3, 2]

    # Get control digit (last digit)
    control = int(digits[-1])

    # Calculate checksum (all digits except last)
    digits_to_check = digits[:-1].zfill(9)  # Pad with zeros if needed
    checksum = sum(int(d) * w for d, w in zip(digits_to_check, weights))

    # Mod 11
    remainder = checksum % 11
    expected_control = 0 if remainder == 10 else remainder

    return control == expected_control

3. `backend/modules/data_entry/services/image_preprocessor.py`

Changes: ~80 lines added

Key Modifications:

A. Add preprocess_medium() method (after line 166):

def preprocess_medium(self, image: np.ndarray) -> np.ndarray:
    """
    Medium preprocessing for MIXED-QUALITY images.
    Balance between Light (too gentle) and Heavy (too aggressive).

    Use cases:
    - Moderately faded receipts
    - Photos with uneven lighting
    - Scans with slight blur

    Preprocessing steps:
    - Moderate contrast enhancement (CLAHE clipLimit=2.0)
    - Light denoising (fastNlMeansDenoising h=6)
    - Gentle sharpening
    - NO binarization (preserves text boundaries)
    - NO morphological operations (avoids digit concatenation)
    """
    # 0. Add safety padding
    image = self._add_safety_padding(image)

    # 1. Grayscale
    if len(image.shape) == 3:
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    else:
        gray = image.copy()

    # 2. Scale (same as Light)
    height, width = gray.shape
    max_side = max(height, width)
    if max_side > 4000:
        scale = 4000 / max_side
        gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_AREA)
        height, width = gray.shape

    if width < 1500:
        scale = 1500 / width
        new_width = int(width * scale)
        new_height = int(height * scale)
        if max(new_width, new_height) > 4000:
            scale = 4000 / max(new_width, new_height)
        gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_CUBIC)

    # 3. Deskew
    gray = self._deskew(gray)

    # 4. Moderate contrast enhancement
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
    enhanced = clahe.apply(gray)

    # 5. Light denoising (less aggressive than Heavy)
    denoised = cv2.fastNlMeansDenoising(enhanced, h=6, templateWindowSize=7, searchWindowSize=15)

    # 6. Gentle sharpening
    gaussian = cv2.GaussianBlur(denoised, (0, 0), 1.0)
    sharpened = cv2.addWeighted(denoised, 1.3, gaussian, -0.3, 0)

    # NO binarization, NO morphological operations
    # This preserves text boundaries and avoids digit concatenation
    return sharpened

B. Mark preprocess_heavy() as deprecated:

def preprocess_heavy(self, image: np.ndarray) -> np.ndarray:
    """
    Heavy preprocessing for FADED thermal receipts.

    ⚠️ DEPRECATED: Use preprocess_medium() instead.
    Heavy preprocessing causes digit concatenation on clear PDFs.
    Kept for backward compatibility only.
    """
    # ... existing code (unchanged)

4. `backend/modules/data_entry/routers/ocr.py`

Changes: ~40 lines modified

Key Modifications:

A. Update ExtractionData schema instantiation (lines 106-128):

# Add validation warnings to response
validation_warnings_list = [
    {
        'field': w.field,
        'rule': w.rule,
        'message': w.message,
        'severity': w.severity,
        'suggested_value': w.suggested_value
    }
    for w in result.validation_warnings
] if hasattr(result, 'validation_warnings') else []

data = ExtractionData(
    # ... existing fields ...

    # NEW: Validation fields
    validation_warnings=validation_warnings_list,
    needs_manual_review=getattr(result, 'needs_manual_review', False),
    inter_ocr_ratio=getattr(result, 'inter_ocr_ratio', None),
    inter_ocr_source_used=getattr(result, 'inter_ocr_source_used', None),
)

5. `backend/modules/data_entry/schemas/ocr.py`

Changes: ~20 lines added

Key Modifications:

A. Add validation fields to ExtractionData (after line 57):

class ValidationWarning(BaseModel):
    """Validation warning from OCR extraction."""
    field: str = Field(description="Field name (e.g., 'amount', 'tva_total')")
    rule: str = Field(description="Rule name (e.g., 'amount_range', 'tva_ratio')")
    message: str = Field(description="Human-readable warning message")
    severity: str = Field(description="Severity: 'low', 'medium', 'high'")
    suggested_value: Optional[Any] = Field(default=None, description="Suggested corrected value")

class ExtractionData(BaseModel):
    """Extracted receipt data from OCR."""
    # ... existing fields ...

    # NEW: Validation results
    validation_warnings: List[ValidationWarning] = Field(default=[], description="Validation warnings")
    needs_manual_review: bool = Field(default=False, description="Flag for supervisor review")
    inter_ocr_ratio: Optional[float] = Field(default=None, description="Ratio between OCR engines (>10 = inconsistent)")
    inter_ocr_source_used: Optional[str] = Field(default=None, description="OCR engine used: 'light' or 'medium'")

6. Database Migration: `backend/modules/data_entry/migrations/versions/XXX_add_needs_manual_review.py`

Purpose: Add needs_manual_review column to receipts table Size: ~30 lines (Alembic migration)

"""Add needs_manual_review flag to receipts

Revision ID: XXX
Create Date: 2025-12-30
"""
from alembic import op
import sqlalchemy as sa

revision = 'XXX'
down_revision = 'YYY'  # Previous migration
branch_labels = None
depends_on = None

def upgrade():
    # Add column with default NULL (not FALSE)
    # NULL = not validated yet (old receipts)
    # FALSE = validated, no review needed
    # TRUE = validated, needs review
    op.add_column('receipts', sa.Column('needs_manual_review', sa.Boolean(), nullable=True))

def downgrade():
    op.drop_column('receipts', 'needs_manual_review')

Frontend Integration Points

1. `src/modules/data-entry/views/receipts/ReceiptCreateView.vue`

Changes: Display validation warnings below OCR results

Example:

<template>
  <div class="ocr-results">
    <!-- Existing OCR fields -->

    <!-- NEW: Validation warnings section -->
    <div v-if="ocrData.validation_warnings?.length > 0" class="validation-warnings">
      <h4>
        <i class="pi pi-exclamation-triangle" />
        Avertismente Validare ({{ ocrData.validation_warnings.length }})
      </h4>
      <ul>
        <li
          v-for="(warning, idx) in ocrData.validation_warnings"
          :key="idx"
          :class="`severity-${warning.severity}`"
        >
          <strong>{{ warning.field }}:</strong> {{ warning.message }}
          <span v-if="warning.suggested_value" class="suggestion">
            (sugestie: {{ warning.suggested_value }})
          </span>
        </li>
      </ul>
    </div>

    <!-- NEW: Manual review badge -->
    <div v-if="ocrData.needs_manual_review" class="manual-review-badge">
      <i class="pi pi-flag" />
      Necesită verificare manuală
    </div>
  </div>
</template>

<style scoped>
.validation-warnings {
  margin-top: 1rem;
  padding: 1rem;
  background: #fff3cd;
  border-left: 4px solid #ffc107;
}

.validation-warnings li.severity-low {
  color: #666;
}

.validation-warnings li.severity-medium {
  color: #f57c00;
}

.validation-warnings li.severity-high {
  color: #d32f2f;
  font-weight: bold;
}

.manual-review-badge {
  margin-top: 0.5rem;
  padding: 0.5rem 1rem;
  background: #fff3cd;
  border-radius: 4px;
  display: inline-flex;
  align-items: center;
  gap: 0.5rem;
}
</style>

2. `src/modules/data-entry/components/ocr/OCRPreview.vue`

Changes: Add inter-OCR consistency indicator

Example:

<template>
  <div class="ocr-preview">
    <!-- Existing fields -->

    <!-- NEW: Inter-OCR consistency warning -->
    <div v-if="ocrData.inter_ocr_ratio && ocrData.inter_ocr_ratio > 10" class="ocr-consistency-warning">
      <i class="pi pi-exclamation-circle" />
      Inconsistență detectată între motoarele OCR ({{ Math.round(ocrData.inter_ocr_ratio) }}x diferență).
      <br />
      <small>Valorile folosite provin din: {{ ocrData.inter_ocr_source_used }}</small>
    </div>
  </div>
</template>

Design Decisions

1. Why Validation Warnings Instead of Errors?

Decision: Use non-blocking warnings instead of blocking errors.

Rationale:

User requirement: "Allow save with warnings"
OCR will never be 100% perfect
Users can override incorrect extractions
Supervisor review catches issues before approval

Trade-off: Risk of bad data entering system vs. user frustration with blocked submissions.

Mitigation: Manual review flag ensures supervisor catches issues.

2. Why Replace Heavy with Medium OCR?

Decision: Remove Heavy preprocessing, add Medium preprocessing.

Rationale:

Heavy causes digit concatenation on clear PDFs (production evidence)
Binarization destroys text boundaries on high-quality images
Morphological operations merge adjacent numbers (85.99 → 859,762.16)

Analysis of Heavy Preprocessing (lines 153-164 in image_preprocessor.py):

# 7. Adaptive thresholding (binarization) - PROBLEM!
binary = cv2.adaptiveThreshold(
    sharpened, 255,
    cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
    cv2.THRESH_BINARY,
    blockSize=11, C=5  # Block size can merge nearby digits
)

# 8. Morphological operations - COMPOUNDS THE PROBLEM!
kernel_close = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
result = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel_close)
# MORPH_CLOSE fills small gaps → merges adjacent numbers

Alternative Considered: Keep Heavy but add safeguards. Rejected: Too risky, no benefit for clear PDFs.

3. Why Romanian CIF Mod 11 Validation?

Decision: Implement CIF checksum validation algorithm.

Rationale:

Romanian CIFs have built-in checksum (last digit)
Validates extracted CUI is mathematically correct
Catches OCR digit errors (10562600 vs 10562601)

Algorithm: Mod 11 checksum

Weights: [7, 5, 3, 2, 1, 7, 5, 3, 2] (right-to-left)
Formula: sum(digit[i] * weight[i]) % 11
Control digit: remainder (0 if remainder=10)

Example: RO10562600

Digits: 1,0,5,6,2,6,0,0,[0]
Checksum: 1×7 + 0×5 + 5×3 + 6×2 + 2×1 + 6×7 + 0×5 + 0×3 = 7+0+15+12+2+42+0+0 = 78
78 % 11 = 1 ≠ 0 → INVALID! (This CUI fails validation)

Note: Some older CIFs may not have checksums (pre-2000). Validation is permissive (warning, not error).

4. Why Apply to New Uploads Only?

Decision: Don't reprocess existing receipts.

Rationale:

Migration impact: ~500 existing receipts in DB
Reprocessing cost: OCR is slow (~2-5s per receipt)
Risk: May change existing approved data
Benefit: Minimal (old receipts already reviewed)

Implementation: Migration adds column with default NULL (not FALSE).

Validation Rules Specification

1. Amount Range Validation

Rule: Amount must be between 0.01 and 100,000 RON.

Implementation:

class AmountRangeRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.amount:
            if extraction.amount < Decimal('0.01'):
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='amount_range',
                    message=f'Amount {extraction.amount} is too small (< 0.01 RON)',
                    severity='high'
                ))
            elif extraction.amount > Decimal('100000'):
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='amount_range',
                    message=f'Amount {extraction.amount} exceeds limit (> 100,000 RON)',
                    severity='high'
                ))

            # Check decimal places
            decimal_places = abs(extraction.amount.as_tuple().exponent)
            if decimal_places > 2:
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='decimal_places',
                    message=f'Amount has {decimal_places} decimal places (max 2)',
                    severity='medium',
                    suggested_value=extraction.amount.quantize(Decimal('0.01'))
                ))
        return warnings

Test Cases:

0.00 RON → Warning (too small)
0.01 RON → Valid
85.99 RON → Valid
100,000 RON → Valid
100,001 RON → Warning (too large)
859,762.16 RON → Warning (too large)
85.999 RON → Warning (too many decimals)

2. TVA Ratio Validation

Rule: TVA must be 5-24% of TOTAL amount.

Implementation:

class TVARatioRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.tva_total and extraction.amount:
            # TVA cannot be greater than TOTAL
            if extraction.tva_total > extraction.amount:
                warnings.append(ValidationWarning(
                    field='tva_total',
                    rule='tva_greater_than_total',
                    message=f'TVA ({extraction.tva_total}) cannot be greater than TOTAL ({extraction.amount})',
                    severity='high',
                    suggested_value=None  # Will be auto-corrected by service
                ))
            else:
                # Check ratio
                ratio = extraction.tva_total / extraction.amount * Decimal('100')
                if ratio < Decimal('5'):
                    warnings.append(ValidationWarning(
                        field='tva_total',
                        rule='tva_ratio_low',
                        message=f'TVA is {ratio:.1f}% of total (expected 5-24%)',
                        severity='medium'
                    ))
                elif ratio > Decimal('24'):
                    warnings.append(ValidationWarning(
                        field='tva_total',
                        rule='tva_ratio_high',
                        message=f'TVA is {ratio:.1f}% of total (expected 5-24%)',
                        severity='high'
                    ))
        return warnings

Test Cases:

TVA=14.92, TOTAL=85.99 → 17.3% → Valid
TVA=149,214.92, TOTAL=859,762.16 → 17.3% → Both values wrong (caught by amount_range)
TVA=4.00, TOTAL=100.00 → 4% → Warning (too low)
TVA=100.00, TOTAL=85.99 → 116% → Warning (impossible!)

3. Payment Sum Validation

Rule: CARD + NUMERAR must equal TOTAL (±0.02 RON tolerance).

Implementation:

class PaymentSumRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.payment_methods and extraction.amount:
            payment_sum = sum(pm['amount'] for pm in extraction.payment_methods)
            difference = abs(payment_sum - extraction.amount)

            if difference > Decimal('0.02'):
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='payment_sum_mismatch',
                    message=f'Payment methods sum ({payment_sum}) ≠ TOTAL ({extraction.amount}), diff={difference}',
                    severity='high',
                    suggested_value=payment_sum
                ))
        return warnings

    def auto_correct(self, extraction: ExtractionResult) -> Dict[str, Any]:
        """Auto-correct TOTAL from payment sum if confidence < 80%."""
        corrections = {}
        if extraction.payment_methods and extraction.amount:
            payment_sum = sum(pm['amount'] for pm in extraction.payment_methods)
            difference = abs(payment_sum - extraction.amount)

            if difference > Decimal('0.02') and extraction.confidence_amount < 0.80:
                corrections['amount'] = payment_sum
                print(f"[Auto-Correct] TOTAL corrected: {extraction.amount} → {payment_sum} (from payment methods)", flush=True)
        return corrections

Test Cases:

CARD=50, NUMERAR=35.99, TOTAL=85.99 → Valid
CARD=50, NUMERAR=35.97, TOTAL=85.99 → Diff=0.02 → Valid (tolerance)
CARD=50, NUMERAR=35.00, TOTAL=85.99 → Diff=0.99 → Warning

4. TVA Entries Sum Validation

Rule: Σ(TVA entries) must equal TVA TOTAL (±0.02 RON tolerance).

Implementation:

class TVAEntriesSumRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.tva_entries and extraction.tva_total:
            entries_sum = sum(e['amount'] for e in extraction.tva_entries)
            difference = abs(entries_sum - extraction.tva_total)

            if difference > Decimal('0.02'):
                warnings.append(ValidationWarning(
                    field='tva_total',
                    rule='tva_entries_sum_mismatch',
                    message=f'TVA entries sum ({entries_sum}) ≠ TVA TOTAL ({extraction.tva_total}), diff={difference}',
                    severity='medium',
                    suggested_value=entries_sum
                ))
        return warnings

    def auto_correct(self, extraction: ExtractionResult) -> Dict[str, Any]:
        """Use entries sum as TVA TOTAL if mismatch."""
        corrections = {}
        if extraction.tva_entries and extraction.tva_total:
            entries_sum = sum(e['amount'] for e in extraction.tva_entries)
            difference = abs(entries_sum - extraction.tva_total)

            if difference > Decimal('0.02'):
                corrections['tva_total'] = entries_sum
                print(f"[Auto-Correct] TVA TOTAL corrected: {extraction.tva_total} → {entries_sum} (from entries)", flush=True)
        return corrections

Test Cases:

Entries=[A:19%:14.92], TOTAL=14.92 → Valid
Entries=[A:19%:10.00, B:9%:4.92], TOTAL=14.92 → Valid
Entries=[A:19%:14.92], TOTAL=14.94 → Diff=0.02 → Valid (tolerance)
Entries=[A:19%:14.92], TOTAL=15.00 → Diff=0.08 → Warning

5. Inter-OCR Consistency Validation

Rule: Flag if values differ >10x between OCR engines.

Implementation:

class InterOCRConsistencyRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        """This rule is applied during merge, stores ratio in extraction."""
        warnings = []
        if hasattr(extraction, 'inter_ocr_ratio') and extraction.inter_ocr_ratio:
            if extraction.inter_ocr_ratio > 10:
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='inter_ocr_inconsistency',
                    message=f'Large inconsistency between OCR engines ({extraction.inter_ocr_ratio:.0f}x difference)',
                    severity='high'
                ))
        return warnings

Test Cases:

Light=85.99, Medium=86.00 → Ratio=1.00 → Valid
Light=85.99, Medium=90.00 → Ratio=1.05 → Valid
Light=85.99, Medium=859.76 → Ratio=10.00 → Valid (edge case)
Light=85.99, Medium=859,762.16 → Ratio=10,000 → Warning!

6. CUI Checksum Validation

Rule: Validate Romanian CIF Mod 11 checksum.

Implementation:

class CUIChecksumRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.cui:
            # Normalize CUI
            digits = re.sub(r'\D', '', extraction.cui)

            # Validate length
            if not (6 <= len(digits) <= 10):
                warnings.append(ValidationWarning(
                    field='cui',
                    rule='cui_length',
                    message=f'CUI length invalid: {len(digits)} digits (expected 6-10)',
                    severity='medium'
                ))
                return warnings

            # Validate Mod 11 checksum
            if not self._validate_checksum(digits):
                warnings.append(ValidationWarning(
                    field='cui',
                    rule='cui_checksum',
                    message=f'CUI checksum invalid: {extraction.cui} (failed Mod 11 validation)',
                    severity='medium'  # Medium: some old CIFs don't have checksums
                ))
        return warnings

    def _validate_checksum(self, digits: str) -> bool:
        """Romanian CIF Mod 11 checksum validation."""
        if len(digits) < 2:
            return False

        weights = [7, 5, 3, 2, 1, 7, 5, 3, 2]
        control = int(digits[-1])
        digits_to_check = digits[:-1].zfill(9)

        checksum = sum(int(d) * w for d, w in zip(digits_to_check, weights))
        remainder = checksum % 11
        expected = 0 if remainder == 10 else remainder

        return control == expected

Test Cases:

R010562600 → Checksum validation
R011201891 → Checksum validation
R012345678 → Warning (invalid checksum)
R01234 → Warning (too short)

7. Date Validity Validation

Rule: Date must not be in future, not older than 10 years.

Implementation:

class DateValidityRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.receipt_date:
            today = date.today()

            # Check future date
            if extraction.receipt_date > today:
                warnings.append(ValidationWarning(
                    field='receipt_date',
                    rule='date_future',
                    message=f'Date is in the future: {extraction.receipt_date}',
                    severity='high'
                ))

            # Check too old (10 years)
            cutoff_date = today.replace(year=today.year - 10)
            if extraction.receipt_date < cutoff_date:
                warnings.append(ValidationWarning(
                    field='receipt_date',
                    rule='date_too_old',
                    message=f'Date is older than 10 years: {extraction.receipt_date}',
                    severity='medium'
                ))
        return warnings

Test Cases:

2025-12-30 (today) → Valid
2025-10-11 → Valid
2026-01-01 → Warning (future)
2015-12-31 → Valid (exactly 10 years)
2014-12-31 → Warning (too old)

Acceptance Criteria

Critical Success Criteria (Must Pass)

✅ AC-1: Five-Holding receipt extracts correct values

Given: Production PDF receipt (Five-Holding, 85.99 LEI)
When: OCR processes with new validation
Then:
- TOTAL = 85.99 LEI (NOT 859,762.16)
- TVA = 14.92 LEI (NOT 149,214.92)
- CUI = R010562600
- Overall confidence >= 90%

✅ AC-2: Save works with validation warnings

Given: Receipt with low confidence (75%)
When: User clicks Save
Then:
- Warnings displayed in UI
- Save button enabled
- Receipt saved with needs_manual_review=TRUE

✅ AC-3: Cross-validation: CARD + NUMERAR = TOTAL

Given: Receipt with CARD=50, NUMERAR=35.99
When: OCR extracts TOTAL=85.98 (off by 0.01)
Then:
- Warning displayed: "Payment sum (85.99) ≠ TOTAL (85.98)"
- Suggested value: 85.99
- Auto-corrected if confidence < 80%

✅ AC-4: Cross-validation: Σ(TVA entries) = TVA TOTAL

Given: Receipt with TVA A=10.00, TVA B=4.92
When: OCR extracts TVA TOTAL=14.90 (off by 0.02)
Then:
- Warning displayed: "TVA entries sum (14.92) ≠ TVA TOTAL (14.90)"
- Auto-corrected to 14.92

✅ AC-5: CUI Mod 11 validation works

Given: Receipt with CUI R010562600
When: OCR processes
Then:
- CUI validated against Mod 11 checksum
- If invalid, warning displayed
- Format normalized to "RO" prefix

Secondary Criteria (Nice-to-Have)

🔲 AC-S1: Medium OCR performs better than Heavy

Given: 10 clear PDF receipts
When: Processed with Light → Medium → Tesseract
Then:
- No 10x magnitude errors
- Average confidence >= 90%
- Processing time < 5s

🔲 AC-S2: Validation warnings show in UI

Given: Receipt with 3 validation warnings
When: OCR completes
Then:
- Warning section displayed
- Each warning shows: field, message, severity
- Suggested values displayed if available

Testing Strategy

Unit Tests (~300 lines)

File: backend/modules/data_entry/tests/test_ocr_validation.py

Test Coverage:

# Amount validation
test_amount_range_valid()
test_amount_range_too_small()
test_amount_range_too_large()
test_amount_decimal_places()

# TVA validation
test_tva_ratio_valid()
test_tva_ratio_too_low()
test_tva_ratio_too_high()
test_tva_greater_than_total()
test_tva_entries_sum_matches()
test_tva_entries_sum_mismatch()

# Payment validation
test_payment_sum_matches()
test_payment_sum_mismatch_within_tolerance()
test_payment_sum_mismatch_auto_corrected()

# CUI validation
test_cui_checksum_valid()
test_cui_checksum_invalid()
test_cui_length_invalid()
test_cui_normalization()

# Date validation
test_date_valid()
test_date_future()
test_date_too_old()

# Inter-OCR consistency
test_inter_ocr_consistency_valid()
test_inter_ocr_consistency_10x_difference()

# Validation engine
test_validation_engine_no_warnings()
test_validation_engine_multiple_warnings()
test_validation_engine_auto_corrections()
test_needs_manual_review_flag()

Integration Tests (~200 lines)

File: backend/modules/data_entry/tests/test_ocr_validation_integration.py

Test Coverage:

# Real receipts
test_five_holding_receipt()           # Production case (85.99 not 859,762.16)
test_omv_receipt()                    # Clear PDF, Light OCR only
test_kaufland_receipt()               # Faded thermal, Medium OCR
test_mega_image_receipt()             # Multiple TVA entries

# OCR pipeline
test_light_ocr_high_confidence_skips_medium()
test_light_ocr_low_confidence_runs_medium()
test_medium_ocr_replaces_heavy()
test_validation_runs_after_merge()

# API responses
test_api_returns_validation_warnings()
test_api_returns_needs_manual_review_flag()
test_api_returns_inter_ocr_ratio()
test_api_auto_corrects_amount_from_payments()

# Edge cases
test_no_ocr_engines_available()
test_pdf_with_multiple_pages()
test_receipt_with_no_tva()
test_receipt_with_no_payment_methods()

Manual Testing Checklist

Upload Five-Holding receipt PDF (production case)
- Verify TOTAL = 85.99 (not 859,762.16)
- Verify TVA = 14.92 (not 149,214.92)
- Verify no validation warnings
- Verify overall confidence >= 90%
Upload faded thermal receipt photo
- Verify Medium OCR used (not Heavy)
- Verify readable text extracted
- Verify no digit concatenation
Upload receipt with payment methods
- Verify CARD + NUMERAR displayed
- Verify sum matches TOTAL
- If mismatch, verify warning displayed
Upload receipt with multiple TVA entries
- Verify all TVA entries extracted
- Verify sum matches TVA TOTAL
- If mismatch, verify warning displayed
Submit receipt with warnings
- Verify Save button enabled
- Verify warnings displayed in UI
- Verify needs_manual_review flag set
Filter receipts by "Needs Review"
- Verify filter shows flagged receipts
- Verify supervisor can review

Risks and Mitigations

Risk	Likelihood	Impact	Mitigation
Medium OCR still causes errors	Medium	High	Keep Tesseract as Step 3 fallback; validation catches issues
CUI Mod 11 validation too strict	Medium	Low	Use warning (not error); allow override; some old CIFs don't have checksums
Validation rules too permissive	Low	Medium	Start conservative, tune based on production data
Validation rules too strict	Medium	Low	Non-blocking warnings allow user override
Performance impact	Low	Low	Validation is fast (<10ms); OCR dominates processing time
Breaking changes to API	Low	High	Add new fields, keep existing fields unchanged; frontend optional
Database migration issues	Low	Medium	Use NULL default (not FALSE); test on staging first

Out of Scope

Explicitly NOT included in this feature:

❌ Reprocessing existing receipts - Only new uploads validated
❌ Machine learning OCR improvements - Use existing PaddleOCR/Tesseract
❌ Custom OCR training - Generic models only
❌ Approval workflow changes - Validation is separate from approval
❌ Automatic approval - Always requires supervisor review
❌ Advanced validation rules - Only basic sanity checks
❌ Multi-currency support - RON only for now
❌ Historical receipt validation - Phase 2 feature
❌ OCR confidence tuning - Accept engine defaults
❌ Frontend validation logic - Backend only (frontend displays)

Open Questions

Q1: Should we keep Heavy preprocessing as fallback?

Answer: No. Remove completely. Evidence shows it causes more harm than good on clear PDFs. Medium preprocessing handles mixed-quality images better.

Q2: What tolerance for payment sum validation?

Answer: ±0.02 RON (2 cents). Romanian receipts use 2 decimal places. This handles rounding errors.

Q3: Should CUI validation be blocking or warning?

Answer: Warning only. Some old Romanian CIFs (pre-2000) don't have Mod 11 checksums. Also, OCR may extract digits incorrectly.

Q4: What if Light OCR has high confidence but wrong values?

Answer: Validation catches this. If Light OCR extracts 859,762.16 with 98% confidence, amount_range rule flags it (>100,000 limit). User sees warning.

Q5: Should we reprocess existing receipts with new validation?

Answer: No. Too risky and time-consuming. Apply to new uploads only. If user wants to re-validate old receipt, they can re-upload.

Q6: What about receipts with no payment methods?

Answer: No validation warning. Not all receipts show CARD/NUMERAR breakdown (especially older thermal receipts). Only validate if payment methods are extracted.

Q7: Should validation auto-correct or just warn?

Answer: Both. Auto-correct obvious errors (TOTAL from payment sum if confidence < 80%). Warn for ambiguous cases. Never silently change high-confidence values.

Q8: How to handle receipts from future (clock skew)?

Answer: Warning only (not error). Allow up to 1 day in future (±24h tolerance) for clock skew. Beyond that, warn user.

Estimated Complexity

Overall: High Justification:

File Count: 6 modified, 3 created, 1 migration = 10 files
Line Changes: ~1,135 lines (400 new validation, 300 tests, 200 integration tests, 235 modifications)
Risk Level: Medium (core OCR pipeline changes, but validation is additive)
Testing: 15-20 new test cases, manual testing required
Dependencies: None (uses existing OCR engines)
Complexity Factors:
- Multi-layer validation logic
- Romanian CIF checksum algorithm
- Cross-field validation dependencies
- Inter-OCR comparison logic
- Auto-correction logic
- Frontend integration
- Database migration

Estimated Effort: 2-3 days

Day 1: Validation engine + unit tests
Day 2: OCR pipeline integration + medium preprocessing
Day 3: Frontend integration + manual testing + bug fixes

Dependencies

External Libraries

✅ cv2 (OpenCV) - Already installed
✅ numpy - Already installed
✅ paddleocr - Already installed
✅ tesseract - Already installed
✅ pydantic - Already installed
✅ sqlalchemy - Already installed

Internal Modules

✅ backend/modules/data_entry/services/ocr_service.py
✅ backend/modules/data_entry/services/ocr_extractor.py
✅ backend/modules/data_entry/services/image_preprocessor.py
✅ backend/modules/data_entry/routers/ocr.py
✅ backend/modules/data_entry/schemas/ocr.py
✅ backend/modules/data_entry/db/models/receipt.py

Database Schema Changes

✅ Add needs_manual_review column to receipts table (nullable BOOLEAN)
✅ Alembic migration required

Implementation Notes

Priority Order (Recommended)

Phase 1: Core Validation (Day 1)
- Create ocr/validation.py module
- Implement validation rules (amount, TVA, payment, CUI, date)
- Write unit tests
- Checkpoint: All tests pass
Phase 2: OCR Integration (Day 2 Morning)
- Add preprocess_medium() to image_preprocessor
- Update _merge_extractions() with validation-aware logic
- Remove/deprecate preprocess_heavy()
- Checkpoint: Five-Holding receipt extracts correctly
Phase 3: API Updates (Day 2 Afternoon)
- Update ExtractionResult dataclass with validation fields
- Update API schemas (ocr.py, routers/ocr.py)
- Add database migration
- Checkpoint: API returns validation warnings
Phase 4: Integration Testing (Day 3 Morning)
- Write integration tests
- Test with real receipts (Five-Holding, OMV, Kaufland)
- Checkpoint: All integration tests pass
Phase 5: Frontend & Polish (Day 3 Afternoon)
- Update Vue components to display warnings
- Add "Needs Review" filter
- Manual testing
- Bug fixes
- Checkpoint: Production-ready

Code Quality Standards

✅ Type hints for all functions
✅ Docstrings for all public methods
✅ Unit test coverage >90%
✅ Integration tests for critical paths
✅ Print statements for debugging (will be converted to logging later)
✅ Follow existing code patterns (SQLModel, Pydantic v2, FastAPI)

Performance Considerations

Validation overhead: <10ms per receipt (negligible vs. OCR time)
Medium preprocessing: Similar speed to Heavy (~500ms)
Database migration: Non-blocking (adds NULL column)
Frontend impact: Minimal (only displays warnings)

Project Context

CLAUDE.md: Data Entry module instructions
docs/data-entry/DATA-ENTRY-MODULE.md: Module architecture
docs/ARCHITECTURE-DECISIONS.md: Ultrathin monolith rationale

Technical References

Romanian CIF validation: https://ro.wikipedia.org/wiki/Cod_de_identificare_fiscal%C4%83
OpenCV preprocessing: https://docs.opencv.org/4.x/d7/d4d/tutorial_py_thresholding.html
PaddleOCR docs: https://github.com/PaddlePaddle/PaddleOCR

Similar Features

Payment methods extraction: Already implemented in ocr_extractor.py:1361
TVA entries extraction: Already implemented in ocr_extractor.py:820
Cross-validation logic: Pattern from _cross_validate_and_calculate_amount (lines 468-557)

Summary

This specification provides a comprehensive solution to fix critical OCR data extraction issues in the Data Entry module. The multi-layer validation system ensures data integrity while maintaining user flexibility through non-blocking warnings.

Key Benefits:

✅ Prevents 10,000x magnitude errors (85.99 vs 859,762.16)
✅ Validates cross-field dependencies (payment sum, TVA sum)
✅ Improves CUI extraction with Mod 11 checksum
✅ Replaces problematic Heavy OCR with Medium preprocessing
✅ Non-blocking warnings preserve user workflow
✅ Manual review flag helps supervisors prioritize

Next Steps:

Review and approve specification
Create feature branch: feature/bon-ocr-validation
Implement Phase 1 (validation engine)
Continue with Phases 2-5
Deploy to staging for testing
Monitor production for 1 week before full rollout

Document Version: 1.0 Last Updated: 2025-12-30 Status: Ready for Implementation Estimated Completion: 2026-01-02 (3 working days)

52 KiB Raw Blame History Unescape Escape

Feature Specification: OCR Data Extraction Validation System

Overview

Problem Statement

Current Behavior (BROKEN)

Real Production Example (Five-Holding Receipt)

Impact

User Stories

1. As a user uploading a clear PDF receipt

2. As a user submitting a receipt with warnings

3. As a supervisor reviewing receipts

4. As a system validating cross-field data

5. As a system validating TVA entries

Functional Requirements

Core Requirements (Must-Have)

1. Multi-Layer Validation Pipeline

2. Replace Heavy with Medium OCR

3. Enhanced CUI Extraction

4. User Requirements Integration

Secondary Requirements (Nice-to-Have)

Technical Requirements

Files to Create

1. backend/modules/data_entry/services/ocr/validation.py

2. backend/modules/data_entry/tests/test_ocr_validation.py

3. backend/modules/data_entry/tests/test_ocr_validation_integration.py

Files to Modify

1. backend/modules/data_entry/services/ocr_service.py

2. backend/modules/data_entry/services/ocr_extractor.py

3. backend/modules/data_entry/services/image_preprocessor.py

4. backend/modules/data_entry/routers/ocr.py

5. backend/modules/data_entry/schemas/ocr.py

6. Database Migration: backend/modules/data_entry/migrations/versions/XXX_add_needs_manual_review.py

Frontend Integration Points

1. src/modules/data-entry/views/receipts/ReceiptCreateView.vue

2. src/modules/data-entry/components/ocr/OCRPreview.vue

Design Decisions

1. Why Validation Warnings Instead of Errors?

2. Why Replace Heavy with Medium OCR?

3. Why Romanian CIF Mod 11 Validation?

4. Why Apply to New Uploads Only?

Validation Rules Specification

1. Amount Range Validation

2. TVA Ratio Validation

3. Payment Sum Validation

4. TVA Entries Sum Validation

5. Inter-OCR Consistency Validation

6. CUI Checksum Validation

7. Date Validity Validation

Acceptance Criteria

Critical Success Criteria (Must Pass)

Secondary Criteria (Nice-to-Have)

Testing Strategy

Unit Tests (~300 lines)

Integration Tests (~200 lines)

Manual Testing Checklist

Risks and Mitigations

Out of Scope

Open Questions

Q1: Should we keep Heavy preprocessing as fallback?

Q2: What tolerance for payment sum validation?

Q3: Should CUI validation be blocking or warning?

Q4: What if Light OCR has high confidence but wrong values?

Q5: Should we reprocess existing receipts with new validation?

Q6: What about receipts with no payment methods?

Q7: Should validation auto-correct or just warn?

Q8: How to handle receipts from future (clock skew)?

Estimated Complexity

Dependencies

External Libraries

Internal Modules

Database Schema Changes

Implementation Notes

Priority Order (Recommended)

Code Quality Standards

Performance Considerations

Related Documentation

Project Context

Technical References

Similar Features

Summary

52 KiB

Raw Blame History

1. `backend/modules/data_entry/services/ocr/validation.py`

2. `backend/modules/data_entry/tests/test_ocr_validation.py`

3. `backend/modules/data_entry/tests/test_ocr_validation_integration.py`

1. `backend/modules/data_entry/services/ocr_service.py`

2. `backend/modules/data_entry/services/ocr_extractor.py`

3. `backend/modules/data_entry/services/image_preprocessor.py`

4. `backend/modules/data_entry/routers/ocr.py`

5. `backend/modules/data_entry/schemas/ocr.py`

6. Database Migration: `backend/modules/data_entry/migrations/versions/XXX_add_needs_manual_review.py`

1. `src/modules/data-entry/views/receipts/ReceiptCreateView.vue`

2. `src/modules/data-entry/components/ocr/OCRPreview.vue`