# Feature Specification: OCR Data Extraction Validation System

**Feature ID:** bon-ocr-validation
**Priority:** Critical (P0 - Production Bug)
**Complexity:** High
**Estimated Effort:** 2-3 days
**Created:** 2025-12-30
**Module:** Data Entry (backend/modules/data_entry/)

---

## Overview

Fix critical OCR data extraction issue where PaddleOCR Heavy preprocessing (88% confidence) overwrites correct Light OCR (98% confidence) data with garbage values, causing 10,000x magnitude errors in production receipts.

**Value Proposition:** Prevent incorrect financial data from entering the system, reduce manual corrections, improve user trust in OCR accuracy.

---

## Problem Statement

### Current Behavior (BROKEN)

The OCR processing pipeline (`backend/modules/data_entry/services/ocr_service.py`) uses a 3-step adaptive approach:
1. **Step 1:** PaddleOCR + Light preprocessing (fast, high confidence)
2. **Step 2:** PaddleOCR + Heavy preprocessing (faded receipts)
3. **Step 3:** Tesseract (complement missing fields only)

**Critical Bug:** The `_merge_extractions()` method (lines 240-386) blindly prefers higher OCR confidence scores WITHOUT validating actual extracted values.

### Real Production Example (Five-Holding Receipt)

| Field | Light OCR (98%) ✅ | Heavy OCR (88%) ❌ | Final Result ❌ |
|-------|-------------------|-------------------|-----------------|
| **TOTAL** | 85.99 LEI | 859,762.16 LEI | **859,762.16** (10,000x error!) |
| **TVA** | 14.92 LEI | 149,214.92 LEI | **149,214.92** (10,000x error!) |
| **CUI** | R010562600 | (not found) | R010562600 |
| **Date** | 2025-10-11 | 2025-10-11 | 2025-10-11 |
| **Confidence** | 98% | 88% | 88% (wrong source!) |

**Root Cause:** Heavy preprocessing causes digit concatenation on high-quality PDFs. The binarization and morphological operations (lines 153-164 in `image_preprocessor.py`) merge adjacent numbers, creating garbage values.

### Impact

- **Data Integrity:** Incorrect amounts enter accounting system
- **User Trust:** Users lose confidence in OCR accuracy
- **Manual Work:** Requires manual verification of ALL OCR extractions
- **Financial Risk:** Wrong amounts could be approved without review

---

## User Stories

### 1. As a user uploading a clear PDF receipt
**I want** OCR to extract correct values from the first pass
**So that** I don't have to manually correct obvious errors

**Acceptance Criteria:**
- Light OCR correctly extracts 85.99 LEI (not 859,762.16)
- Heavy OCR is skipped when Light OCR confidence >= 90%
- No 10,000x magnitude errors

### 2. As a user submitting a receipt with warnings
**I want** to be able to save receipts with validation warnings
**So that** I can submit for review even if OCR isn't perfect

**Acceptance Criteria:**
- Save button works with warnings (not blocked)
- Receipt marked with `needs_manual_review=True`
- Warnings displayed clearly in UI

### 3. As a supervisor reviewing receipts
**I want** to see which receipts need manual review
**So that** I can prioritize validation efforts

**Acceptance Criteria:**
- Filter by "Needs Review" flag
- Validation warnings shown in detail view
- Clear indication of which fields are suspicious

### 4. As a system validating cross-field data
**I want** to validate CARD + NUMERAR = TOTAL
**So that** payment methods match the total amount

**Acceptance Criteria:**
- Cross-validation: sum of payment methods = TOTAL (±0.02 RON tolerance)
- If mismatch, flag for review
- Auto-correct TOTAL from payment sum if confidence < 80%

### 5. As a system validating TVA entries
**I want** to validate Σ(TVA entries) = TVA TOTAL
**So that** individual TVA lines match the total TVA

**Acceptance Criteria:**
- Cross-validation: sum of TVA entries = TVA TOTAL (±0.02 RON tolerance)
- TVA rate validation (5-24% of TOTAL)
- If mismatch, flag for review

---

## Functional Requirements

### Core Requirements (Must-Have)

#### 1. Multi-Layer Validation Pipeline

**FR-1.1:** Absolute value sanity checks
- Amount range: 0.01 - 100,000 RON
- Max 2 decimal places
- Date: not in future, not older than 10 years (2015+)
- CUI: 6-10 digits, valid Mod 11 checksum

**FR-1.2:** Cross-field correlation validation
- TVA: 5-24% of TOTAL amount (Romanian rates: 5%, 9%, 11%, 19%, 21%)
- Payment methods: CARD + NUMERAR = TOTAL (±0.02 RON tolerance)
- Inter-OCR consistency: flag if values differ >10x between engines

**FR-1.3:** Auto-correction logic
- If TOTAL is obviously wrong (>10x payment sum), use payment sum
- If TVA > TOTAL, recalculate TOTAL from TVA using reverse formula
- Preserve high-confidence values from Light OCR over low-confidence Heavy OCR

**FR-1.4:** Validation result structure
```python
@dataclass
class ValidationResult:
    is_valid: bool
    warnings: List[ValidationWarning]  # Non-blocking issues
    errors: List[ValidationError]      # Blocking issues (none for now)
    corrected_fields: Dict[str, Any]   # Auto-corrected values
    needs_manual_review: bool          # Flag for supervisor
```

#### 2. Replace Heavy with Medium OCR

**FR-2.1:** Remove `preprocess_heavy()` method
- Current Heavy: aggressive binarization causes digit concatenation
- Reason: Destroys high-quality PDFs while trying to recover faded receipts

**FR-2.2:** Add `preprocess_medium()` method
- Moderate contrast enhancement (CLAHE clipLimit=2.0)
- Light denoising (fastNlMeansDenoising h=6)
- NO binarization, NO morphological operations
- Preserve text boundaries on clear images

**FR-2.3:** Update OCR pipeline
- Step 1: Light preprocessing (unchanged)
- Step 2: **Medium** preprocessing (replaces Heavy)
- Step 3: Tesseract (unchanged)

#### 3. Enhanced CUI Extraction

**FR-3.1:** Romanian CIF validation algorithm
- Implement Mod 11 checksum validation
- Control digit formula: `sum(digit[i] * weight[i]) % 11`
- Weights: `[7, 5, 3, 2, 1, 7, 5, 3, 2]` (right-to-left)

**FR-3.2:** CUI format normalization
- Always add "RO" prefix if missing
- Remove spaces, dashes, dots
- Validate length: 6-10 digits

**FR-3.3:** Improved regex patterns
```python
# Add OCR-tolerant patterns (current patterns are too strict)
CUI_OCR_TOLERANT_PATTERNS = [
    r'CIF[:\s]*R[O0]?\s*(\d[\d\s]{5,9})',  # Spaces in CUI
    r'C[I1]F[:\s]*(\d[\d\s]{6,10})',        # C1F (I→1 OCR error)
    r'C\.?\s*[I1]\.?\s*F\.?[:\s]*(\d+)',   # C. I. F. (spaced)
]
```

#### 4. User Requirements Integration

**FR-4.1:** Non-blocking validation warnings
- Save button enabled even with warnings
- User can override and submit
- Warnings displayed clearly in UI

**FR-4.2:** Manual review flag
- Database field: `receipts.needs_manual_review` (BOOLEAN)
- Set to `TRUE` if:
  - Any validation warning present
  - Overall confidence < 85%
  - Cross-validation fails

**FR-4.3:** Apply to new uploads only
- No reprocessing of existing receipts
- Validation runs on OCR extraction (POST /api/ocr/extract)
- Migration: add column with default NULL (not FALSE)

### Secondary Requirements (Nice-to-Have)

**FR-S1:** Validation confidence scoring
- Each validation rule contributes to score
- Overall validation confidence: weighted average
- Display in UI alongside OCR confidence

**FR-S2:** Validation rule configurability
- Move hardcoded thresholds to config
- Allow per-company customization
- Admin UI to adjust rules

---

## Technical Requirements

### Files to Create

#### 1. `backend/modules/data_entry/services/ocr/validation.py`
**Purpose:** Validation utilities and rule engine
**Size:** ~400 lines
**Key Classes:**
- `ValidationRule` (base class)
- `AmountRangeRule`, `TVARatioRule`, `PaymentSumRule`, `CUIChecksumRule`
- `OCRValidationEngine` (orchestrator)

**Example:**
```python
@dataclass
class ValidationWarning:
    """Non-blocking validation warning."""
    field: str
    rule: str
    message: str
    severity: str  # 'low', 'medium', 'high'
    suggested_value: Optional[Any] = None

class ValidationRule(ABC):
    """Base validation rule."""
    @abstractmethod
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        pass

class AmountRangeRule(ValidationRule):
    """Validate amount is in reasonable range (0.01 - 100,000 RON)."""
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.amount:
            if extraction.amount < Decimal('0.01'):
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='amount_range',
                    message=f'Amount {extraction.amount} is too small (< 0.01 RON)',
                    severity='high'
                ))
            elif extraction.amount > Decimal('100000'):
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='amount_range',
                    message=f'Amount {extraction.amount} exceeds limit (> 100,000 RON)',
                    severity='high'
                ))
        return warnings

class OCRValidationEngine:
    """Orchestrate all validation rules."""
    def __init__(self):
        self.rules = [
            AmountRangeRule(),
            TVARatioRule(),
            PaymentSumRule(),
            InterOCRConsistencyRule(),
            CUIChecksumRule(),
            DateValidityRule(),
        ]

    def validate(self, extraction: ExtractionResult) -> ValidationResult:
        """Run all validation rules and return result."""
        all_warnings = []
        corrected_fields = {}

        for rule in self.rules:
            warnings = rule.validate(extraction)
            all_warnings.extend(warnings)

            # Apply auto-corrections
            corrections = rule.auto_correct(extraction)
            corrected_fields.update(corrections)

        needs_review = (
            len(all_warnings) > 0 or
            extraction.overall_confidence < 0.85
        )

        return ValidationResult(
            is_valid=True,  # Never block (warnings only)
            warnings=all_warnings,
            errors=[],
            corrected_fields=corrected_fields,
            needs_manual_review=needs_review
        )
```

#### 2. `backend/modules/data_entry/tests/test_ocr_validation.py`
**Purpose:** Unit tests for validation rules
**Size:** ~300 lines
**Coverage Target:** >90%

**Test Cases:**
- `test_amount_range_valid()` - 85.99 RON passes
- `test_amount_range_too_high()` - 859,762.16 fails
- `test_tva_ratio_valid()` - 14.92/85.99 = 17.3% passes
- `test_tva_ratio_too_high()` - 149,214.92/859,762.16 = 17.3% but amounts wrong
- `test_payment_sum_matches()` - CARD 50 + NUMERAR 35.99 = TOTAL 85.99
- `test_cui_checksum_valid()` - R010562600 passes Mod 11
- `test_cui_checksum_invalid()` - R010562601 fails Mod 11
- `test_inter_ocr_consistency()` - 85.99 vs 859,762.16 = 10,000x flag

#### 3. `backend/modules/data_entry/tests/test_ocr_validation_integration.py`
**Purpose:** Integration tests with full OCR pipeline
**Size:** ~200 lines

**Test Cases:**
- `test_five_holding_receipt()` - Real production case (85.99 not 859,762.16)
- `test_clear_pdf_uses_light_ocr()` - High-quality PDF skips Heavy
- `test_faded_receipt_uses_medium_ocr()` - Thermal receipt uses Medium
- `test_validation_warnings_in_response()` - API returns warnings
- `test_manual_review_flag_set()` - Flag set when confidence < 85%

### Files to Modify

#### 1. `backend/modules/data_entry/services/ocr_service.py`
**Changes:** ~200 lines modified, ~100 lines added

**Key Modifications:**

**A. Replace `_merge_extractions()` (lines 240-386) with validation-aware version:**
```python
def _merge_extractions(
    self,
    light: Optional[ExtractionResult],
    medium: Optional[ExtractionResult]  # Renamed from 'tesseract'
) -> ExtractionResult:
    """
    Merge extractions with VALIDATION-AWARE logic.

    NEW Strategy:
    1. Run validation on both extractions
    2. Prefer extraction with FEWER warnings (not just higher confidence)
    3. For each field, pick value that passes validation
    4. Flag inter-OCR inconsistencies (>10x difference)
    """
    from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine

    validator = OCRValidationEngine()

    # Validate both extractions
    light_validation = validator.validate(light) if light else None
    medium_validation = validator.validate(medium) if medium else None

    result = ExtractionResult()

    # === AMOUNT (with validation check) ===
    if light.amount and medium.amount:
        # Check for 10x inconsistency
        ratio = max(light.amount, medium.amount) / min(light.amount, medium.amount)
        if ratio > 10:
            print(f"[Merge] WARNING: Inter-OCR inconsistency: {light.amount} vs {medium.amount} ({ratio:.0f}x)", flush=True)
            # Prefer value that passes validation
            light_warnings = [w for w in light_validation.warnings if w.field == 'amount']
            medium_warnings = [w for w in medium_validation.warnings if w.field == 'amount']

            if len(light_warnings) < len(medium_warnings):
                result.amount = light.amount
                result.confidence_amount = light.confidence_amount
                print(f"[Merge] Using Light OCR amount: {light.amount} (fewer warnings)", flush=True)
            else:
                result.amount = medium.amount
                result.confidence_amount = medium.confidence_amount
                print(f"[Merge] Using Medium OCR amount: {medium.amount} (fewer warnings)", flush=True)
        else:
            # Normal merge: prefer higher confidence
            if light.confidence_amount >= medium.confidence_amount:
                result.amount = light.amount
                result.confidence_amount = light.confidence_amount
            else:
                result.amount = medium.amount
                result.confidence_amount = medium.confidence_amount
    elif light.amount:
        result.amount = light.amount
        result.confidence_amount = light.confidence_amount
    elif medium.amount:
        result.amount = medium.amount
        result.confidence_amount = medium.confidence_amount

    # ... (similar logic for other fields)

    return result
```

**B. Add `preprocess_medium()` call (replace Heavy):**
```python
# Line ~130: Replace preprocess_heavy with preprocess_medium
print("=" * 60, flush=True)
print("[OCR] STEP 2: PaddleOCR + Medium preprocessing", flush=True)
print("=" * 60, flush=True)
medium_img = self.preprocessor.preprocess_medium(image)  # NEW

try:
    paddle_medium = self.ocr_engine._paddle_recognize(medium_img)
    # ... rest of processing
```

**C. Add validation to final result:**
```python
# Line ~204: Add validation before returning
if extraction:
    extraction = self._final_validation(extraction)

    # NEW: Run validation engine
    from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine
    validator = OCRValidationEngine()
    validation_result = validator.validate(extraction)

    # Apply auto-corrections
    for field, value in validation_result.corrected_fields.items():
        setattr(extraction, field, value)

    # Store validation warnings (add to ExtractionResult)
    extraction.validation_warnings = validation_result.warnings
    extraction.needs_manual_review = validation_result.needs_manual_review
```

#### 2. `backend/modules/data_entry/services/ocr_extractor.py`
**Changes:** ~50 lines modified, ~30 lines added

**Key Modifications:**

**A. Add validation fields to `ExtractionResult` (lines 10-50):**
```python
@dataclass
class ExtractionResult:
    """Structured extraction result from receipt."""
    # ... existing fields ...

    # NEW: Validation results
    validation_warnings: List[dict] = field(default_factory=list)  # List of warnings
    needs_manual_review: bool = False  # Flag for supervisor review

    # NEW: Inter-OCR comparison data
    inter_ocr_ratio: Optional[float] = None  # Ratio between Light/Heavy values
    inter_ocr_source_used: Optional[str] = None  # 'light' or 'medium'
```

**B. Fix CLIENT CUI patterns (lines 253-272):**
```python
# Current patterns are too strict - add OCR-tolerant versions
CLIENT_CUI_PATTERNS = [
    # ... existing patterns ...

    # NEW: OCR-tolerant patterns
    (r'CLIENT\s+C[I1UO]F\s*[:/]?\s*(?:R[O0])?(\d[\d\s]{5,9})', 0.96),  # Spaces in CUI
    (r'C[I1]F\s+CLIENT\s*[:/]?\s*(?:R[O0])?(\d[\d\s]{5,9})', 0.96),    # Reversed format
    (r'CLIENT.*?(?:R[O0])?(\d{6,10})\s*\n', 0.90),                      # CUI on next line
]
```

**C. Add CUI normalization and validation:**
```python
def _normalize_cui(self, cui: Optional[str]) -> Optional[str]:
    """Normalize CUI format and validate checksum."""
    if not cui:
        return None

    # Remove non-digits
    digits = re.sub(r'\D', '', cui)

    # Validate length
    if not (6 <= len(digits) <= 10):
        return None

    # Validate Mod 11 checksum (Romanian CIF algorithm)
    if not self._validate_cui_checksum(digits):
        print(f"[CUI Validation] Invalid checksum: {digits}", flush=True)
        return None

    # Add RO prefix
    return f"RO{digits}"

def _validate_cui_checksum(self, digits: str) -> bool:
    """Validate Romanian CIF Mod 11 checksum."""
    if len(digits) < 2:
        return False

    # Weights: 7, 5, 3, 2, 1, 7, 5, 3, 2 (right-to-left)
    weights = [7, 5, 3, 2, 1, 7, 5, 3, 2]

    # Get control digit (last digit)
    control = int(digits[-1])

    # Calculate checksum (all digits except last)
    digits_to_check = digits[:-1].zfill(9)  # Pad with zeros if needed
    checksum = sum(int(d) * w for d, w in zip(digits_to_check, weights))

    # Mod 11
    remainder = checksum % 11
    expected_control = 0 if remainder == 10 else remainder

    return control == expected_control
```

#### 3. `backend/modules/data_entry/services/image_preprocessor.py`
**Changes:** ~80 lines added

**Key Modifications:**

**A. Add `preprocess_medium()` method (after line 166):**
```python
def preprocess_medium(self, image: np.ndarray) -> np.ndarray:
    """
    Medium preprocessing for MIXED-QUALITY images.
    Balance between Light (too gentle) and Heavy (too aggressive).

    Use cases:
    - Moderately faded receipts
    - Photos with uneven lighting
    - Scans with slight blur

    Preprocessing steps:
    - Moderate contrast enhancement (CLAHE clipLimit=2.0)
    - Light denoising (fastNlMeansDenoising h=6)
    - Gentle sharpening
    - NO binarization (preserves text boundaries)
    - NO morphological operations (avoids digit concatenation)
    """
    # 0. Add safety padding
    image = self._add_safety_padding(image)

    # 1. Grayscale
    if len(image.shape) == 3:
        gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    else:
        gray = image.copy()

    # 2. Scale (same as Light)
    height, width = gray.shape
    max_side = max(height, width)
    if max_side > 4000:
        scale = 4000 / max_side
        gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_AREA)
        height, width = gray.shape

    if width < 1500:
        scale = 1500 / width
        new_width = int(width * scale)
        new_height = int(height * scale)
        if max(new_width, new_height) > 4000:
            scale = 4000 / max(new_width, new_height)
        gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_CUBIC)

    # 3. Deskew
    gray = self._deskew(gray)

    # 4. Moderate contrast enhancement
    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
    enhanced = clahe.apply(gray)

    # 5. Light denoising (less aggressive than Heavy)
    denoised = cv2.fastNlMeansDenoising(enhanced, h=6, templateWindowSize=7, searchWindowSize=15)

    # 6. Gentle sharpening
    gaussian = cv2.GaussianBlur(denoised, (0, 0), 1.0)
    sharpened = cv2.addWeighted(denoised, 1.3, gaussian, -0.3, 0)

    # NO binarization, NO morphological operations
    # This preserves text boundaries and avoids digit concatenation
    return sharpened
```

**B. Mark `preprocess_heavy()` as deprecated:**
```python
def preprocess_heavy(self, image: np.ndarray) -> np.ndarray:
    """
    Heavy preprocessing for FADED thermal receipts.

    ⚠️ DEPRECATED: Use preprocess_medium() instead.
    Heavy preprocessing causes digit concatenation on clear PDFs.
    Kept for backward compatibility only.
    """
    # ... existing code (unchanged)
```

#### 4. `backend/modules/data_entry/routers/ocr.py`
**Changes:** ~40 lines modified

**Key Modifications:**

**A. Update `ExtractionData` schema instantiation (lines 106-128):**
```python
# Add validation warnings to response
validation_warnings_list = [
    {
        'field': w.field,
        'rule': w.rule,
        'message': w.message,
        'severity': w.severity,
        'suggested_value': w.suggested_value
    }
    for w in result.validation_warnings
] if hasattr(result, 'validation_warnings') else []

data = ExtractionData(
    # ... existing fields ...

    # NEW: Validation fields
    validation_warnings=validation_warnings_list,
    needs_manual_review=getattr(result, 'needs_manual_review', False),
    inter_ocr_ratio=getattr(result, 'inter_ocr_ratio', None),
    inter_ocr_source_used=getattr(result, 'inter_ocr_source_used', None),
)
```

#### 5. `backend/modules/data_entry/schemas/ocr.py`
**Changes:** ~20 lines added

**Key Modifications:**

**A. Add validation fields to `ExtractionData` (after line 57):**
```python
class ValidationWarning(BaseModel):
    """Validation warning from OCR extraction."""
    field: str = Field(description="Field name (e.g., 'amount', 'tva_total')")
    rule: str = Field(description="Rule name (e.g., 'amount_range', 'tva_ratio')")
    message: str = Field(description="Human-readable warning message")
    severity: str = Field(description="Severity: 'low', 'medium', 'high'")
    suggested_value: Optional[Any] = Field(default=None, description="Suggested corrected value")

class ExtractionData(BaseModel):
    """Extracted receipt data from OCR."""
    # ... existing fields ...

    # NEW: Validation results
    validation_warnings: List[ValidationWarning] = Field(default=[], description="Validation warnings")
    needs_manual_review: bool = Field(default=False, description="Flag for supervisor review")
    inter_ocr_ratio: Optional[float] = Field(default=None, description="Ratio between OCR engines (>10 = inconsistent)")
    inter_ocr_source_used: Optional[str] = Field(default=None, description="OCR engine used: 'light' or 'medium'")
```

#### 6. Database Migration: `backend/modules/data_entry/migrations/versions/XXX_add_needs_manual_review.py`
**Purpose:** Add `needs_manual_review` column to `receipts` table
**Size:** ~30 lines (Alembic migration)

```python
"""Add needs_manual_review flag to receipts

Revision ID: XXX
Create Date: 2025-12-30
"""
from alembic import op
import sqlalchemy as sa

revision = 'XXX'
down_revision = 'YYY'  # Previous migration
branch_labels = None
depends_on = None

def upgrade():
    # Add column with default NULL (not FALSE)
    # NULL = not validated yet (old receipts)
    # FALSE = validated, no review needed
    # TRUE = validated, needs review
    op.add_column('receipts', sa.Column('needs_manual_review', sa.Boolean(), nullable=True))

def downgrade():
    op.drop_column('receipts', 'needs_manual_review')
```

### Frontend Integration Points

#### 1. `src/modules/data-entry/views/receipts/ReceiptCreateView.vue`
**Changes:** Display validation warnings below OCR results

**Example:**
```vue
<template>
  <div class="ocr-results">
    <!-- Existing OCR fields -->

    <!-- NEW: Validation warnings section -->
    <div v-if="ocrData.validation_warnings?.length > 0" class="validation-warnings">
      <h4>
        <i class="pi pi-exclamation-triangle" />
        Avertismente Validare ({{ ocrData.validation_warnings.length }})
      </h4>
      <ul>
        <li
          v-for="(warning, idx) in ocrData.validation_warnings"
          :key="idx"
          :class="`severity-${warning.severity}`"
        >
          <strong>{{ warning.field }}:</strong> {{ warning.message }}
          <span v-if="warning.suggested_value" class="suggestion">
            (sugestie: {{ warning.suggested_value }})
          </span>
        </li>
      </ul>
    </div>

    <!-- NEW: Manual review badge -->
    <div v-if="ocrData.needs_manual_review" class="manual-review-badge">
      <i class="pi pi-flag" />
      Necesită verificare manuală
    </div>
  </div>
</template>

<style scoped>
.validation-warnings {
  margin-top: 1rem;
  padding: 1rem;
  background: #fff3cd;
  border-left: 4px solid #ffc107;
}

.validation-warnings li.severity-low {
  color: #666;
}

.validation-warnings li.severity-medium {
  color: #f57c00;
}

.validation-warnings li.severity-high {
  color: #d32f2f;
  font-weight: bold;
}

.manual-review-badge {
  margin-top: 0.5rem;
  padding: 0.5rem 1rem;
  background: #fff3cd;
  border-radius: 4px;
  display: inline-flex;
  align-items: center;
  gap: 0.5rem;
}
</style>
```

#### 2. `src/modules/data-entry/components/ocr/OCRPreview.vue`
**Changes:** Add inter-OCR consistency indicator

**Example:**
```vue
<template>
  <div class="ocr-preview">
    <!-- Existing fields -->

    <!-- NEW: Inter-OCR consistency warning -->
    <div v-if="ocrData.inter_ocr_ratio && ocrData.inter_ocr_ratio > 10" class="ocr-consistency-warning">
      <i class="pi pi-exclamation-circle" />
      Inconsistență detectată între motoarele OCR ({{ Math.round(ocrData.inter_ocr_ratio) }}x diferență).
      <br />
      <small>Valorile folosite provin din: {{ ocrData.inter_ocr_source_used }}</small>
    </div>
  </div>
</template>
```

---

## Design Decisions

### 1. Why Validation Warnings Instead of Errors?

**Decision:** Use non-blocking warnings instead of blocking errors.

**Rationale:**
- User requirement: "Allow save with warnings"
- OCR will never be 100% perfect
- Users can override incorrect extractions
- Supervisor review catches issues before approval

**Trade-off:** Risk of bad data entering system vs. user frustration with blocked submissions.

**Mitigation:** Manual review flag ensures supervisor catches issues.

### 2. Why Replace Heavy with Medium OCR?

**Decision:** Remove Heavy preprocessing, add Medium preprocessing.

**Rationale:**
- **Heavy causes digit concatenation** on clear PDFs (production evidence)
- Binarization destroys text boundaries on high-quality images
- Morphological operations merge adjacent numbers (85.99 → 859,762.16)

**Analysis of Heavy Preprocessing (lines 153-164 in `image_preprocessor.py`):**
```python
# 7. Adaptive thresholding (binarization) - PROBLEM!
binary = cv2.adaptiveThreshold(
    sharpened, 255,
    cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
    cv2.THRESH_BINARY,
    blockSize=11, C=5  # Block size can merge nearby digits
)

# 8. Morphological operations - COMPOUNDS THE PROBLEM!
kernel_close = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
result = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel_close)
# MORPH_CLOSE fills small gaps → merges adjacent numbers
```

**Alternative Considered:** Keep Heavy but add safeguards. **Rejected:** Too risky, no benefit for clear PDFs.

### 3. Why Romanian CIF Mod 11 Validation?

**Decision:** Implement CIF checksum validation algorithm.

**Rationale:**
- Romanian CIFs have built-in checksum (last digit)
- Validates extracted CUI is mathematically correct
- Catches OCR digit errors (10562600 vs 10562601)

**Algorithm:** Mod 11 checksum
- Weights: [7, 5, 3, 2, 1, 7, 5, 3, 2] (right-to-left)
- Formula: `sum(digit[i] * weight[i]) % 11`
- Control digit: remainder (0 if remainder=10)

**Example:** RO10562600
- Digits: 1,0,5,6,2,6,0,0,[0]
- Checksum: 1×7 + 0×5 + 5×3 + 6×2 + 2×1 + 6×7 + 0×5 + 0×3 = 7+0+15+12+2+42+0+0 = 78
- 78 % 11 = 1 ≠ 0 → **INVALID!** (This CUI fails validation)

**Note:** Some older CIFs may not have checksums (pre-2000). Validation is permissive (warning, not error).

### 4. Why Apply to New Uploads Only?

**Decision:** Don't reprocess existing receipts.

**Rationale:**
- Migration impact: ~500 existing receipts in DB
- Reprocessing cost: OCR is slow (~2-5s per receipt)
- Risk: May change existing approved data
- Benefit: Minimal (old receipts already reviewed)

**Implementation:** Migration adds column with default NULL (not FALSE).

---

## Validation Rules Specification

### 1. Amount Range Validation

**Rule:** Amount must be between 0.01 and 100,000 RON.

**Implementation:**
```python
class AmountRangeRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.amount:
            if extraction.amount < Decimal('0.01'):
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='amount_range',
                    message=f'Amount {extraction.amount} is too small (< 0.01 RON)',
                    severity='high'
                ))
            elif extraction.amount > Decimal('100000'):
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='amount_range',
                    message=f'Amount {extraction.amount} exceeds limit (> 100,000 RON)',
                    severity='high'
                ))

            # Check decimal places
            decimal_places = abs(extraction.amount.as_tuple().exponent)
            if decimal_places > 2:
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='decimal_places',
                    message=f'Amount has {decimal_places} decimal places (max 2)',
                    severity='medium',
                    suggested_value=extraction.amount.quantize(Decimal('0.01'))
                ))
        return warnings
```

**Test Cases:**
- 0.00 RON → Warning (too small)
- 0.01 RON → Valid
- 85.99 RON → Valid
- 100,000 RON → Valid
- 100,001 RON → Warning (too large)
- 859,762.16 RON → Warning (too large)
- 85.999 RON → Warning (too many decimals)

### 2. TVA Ratio Validation

**Rule:** TVA must be 5-24% of TOTAL amount.

**Implementation:**
```python
class TVARatioRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.tva_total and extraction.amount:
            # TVA cannot be greater than TOTAL
            if extraction.tva_total > extraction.amount:
                warnings.append(ValidationWarning(
                    field='tva_total',
                    rule='tva_greater_than_total',
                    message=f'TVA ({extraction.tva_total}) cannot be greater than TOTAL ({extraction.amount})',
                    severity='high',
                    suggested_value=None  # Will be auto-corrected by service
                ))
            else:
                # Check ratio
                ratio = extraction.tva_total / extraction.amount * Decimal('100')
                if ratio < Decimal('5'):
                    warnings.append(ValidationWarning(
                        field='tva_total',
                        rule='tva_ratio_low',
                        message=f'TVA is {ratio:.1f}% of total (expected 5-24%)',
                        severity='medium'
                    ))
                elif ratio > Decimal('24'):
                    warnings.append(ValidationWarning(
                        field='tva_total',
                        rule='tva_ratio_high',
                        message=f'TVA is {ratio:.1f}% of total (expected 5-24%)',
                        severity='high'
                    ))
        return warnings
```

**Test Cases:**
- TVA=14.92, TOTAL=85.99 → 17.3% → Valid
- TVA=149,214.92, TOTAL=859,762.16 → 17.3% → Both values wrong (caught by amount_range)
- TVA=4.00, TOTAL=100.00 → 4% → Warning (too low)
- TVA=100.00, TOTAL=85.99 → 116% → Warning (impossible!)

### 3. Payment Sum Validation

**Rule:** CARD + NUMERAR must equal TOTAL (±0.02 RON tolerance).

**Implementation:**
```python
class PaymentSumRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.payment_methods and extraction.amount:
            payment_sum = sum(pm['amount'] for pm in extraction.payment_methods)
            difference = abs(payment_sum - extraction.amount)

            if difference > Decimal('0.02'):
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='payment_sum_mismatch',
                    message=f'Payment methods sum ({payment_sum}) ≠ TOTAL ({extraction.amount}), diff={difference}',
                    severity='high',
                    suggested_value=payment_sum
                ))
        return warnings

    def auto_correct(self, extraction: ExtractionResult) -> Dict[str, Any]:
        """Auto-correct TOTAL from payment sum if confidence < 80%."""
        corrections = {}
        if extraction.payment_methods and extraction.amount:
            payment_sum = sum(pm['amount'] for pm in extraction.payment_methods)
            difference = abs(payment_sum - extraction.amount)

            if difference > Decimal('0.02') and extraction.confidence_amount < 0.80:
                corrections['amount'] = payment_sum
                print(f"[Auto-Correct] TOTAL corrected: {extraction.amount} → {payment_sum} (from payment methods)", flush=True)
        return corrections
```

**Test Cases:**
- CARD=50, NUMERAR=35.99, TOTAL=85.99 → Valid
- CARD=50, NUMERAR=35.97, TOTAL=85.99 → Diff=0.02 → Valid (tolerance)
- CARD=50, NUMERAR=35.00, TOTAL=85.99 → Diff=0.99 → Warning

### 4. TVA Entries Sum Validation

**Rule:** Σ(TVA entries) must equal TVA TOTAL (±0.02 RON tolerance).

**Implementation:**
```python
class TVAEntriesSumRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.tva_entries and extraction.tva_total:
            entries_sum = sum(e['amount'] for e in extraction.tva_entries)
            difference = abs(entries_sum - extraction.tva_total)

            if difference > Decimal('0.02'):
                warnings.append(ValidationWarning(
                    field='tva_total',
                    rule='tva_entries_sum_mismatch',
                    message=f'TVA entries sum ({entries_sum}) ≠ TVA TOTAL ({extraction.tva_total}), diff={difference}',
                    severity='medium',
                    suggested_value=entries_sum
                ))
        return warnings

    def auto_correct(self, extraction: ExtractionResult) -> Dict[str, Any]:
        """Use entries sum as TVA TOTAL if mismatch."""
        corrections = {}
        if extraction.tva_entries and extraction.tva_total:
            entries_sum = sum(e['amount'] for e in extraction.tva_entries)
            difference = abs(entries_sum - extraction.tva_total)

            if difference > Decimal('0.02'):
                corrections['tva_total'] = entries_sum
                print(f"[Auto-Correct] TVA TOTAL corrected: {extraction.tva_total} → {entries_sum} (from entries)", flush=True)
        return corrections
```

**Test Cases:**
- Entries=[A:19%:14.92], TOTAL=14.92 → Valid
- Entries=[A:19%:10.00, B:9%:4.92], TOTAL=14.92 → Valid
- Entries=[A:19%:14.92], TOTAL=14.94 → Diff=0.02 → Valid (tolerance)
- Entries=[A:19%:14.92], TOTAL=15.00 → Diff=0.08 → Warning

### 5. Inter-OCR Consistency Validation

**Rule:** Flag if values differ >10x between OCR engines.

**Implementation:**
```python
class InterOCRConsistencyRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        """This rule is applied during merge, stores ratio in extraction."""
        warnings = []
        if hasattr(extraction, 'inter_ocr_ratio') and extraction.inter_ocr_ratio:
            if extraction.inter_ocr_ratio > 10:
                warnings.append(ValidationWarning(
                    field='amount',
                    rule='inter_ocr_inconsistency',
                    message=f'Large inconsistency between OCR engines ({extraction.inter_ocr_ratio:.0f}x difference)',
                    severity='high'
                ))
        return warnings
```

**Test Cases:**
- Light=85.99, Medium=86.00 → Ratio=1.00 → Valid
- Light=85.99, Medium=90.00 → Ratio=1.05 → Valid
- Light=85.99, Medium=859.76 → Ratio=10.00 → Valid (edge case)
- Light=85.99, Medium=859,762.16 → Ratio=10,000 → Warning!

### 6. CUI Checksum Validation

**Rule:** Validate Romanian CIF Mod 11 checksum.

**Implementation:**
```python
class CUIChecksumRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.cui:
            # Normalize CUI
            digits = re.sub(r'\D', '', extraction.cui)

            # Validate length
            if not (6 <= len(digits) <= 10):
                warnings.append(ValidationWarning(
                    field='cui',
                    rule='cui_length',
                    message=f'CUI length invalid: {len(digits)} digits (expected 6-10)',
                    severity='medium'
                ))
                return warnings

            # Validate Mod 11 checksum
            if not self._validate_checksum(digits):
                warnings.append(ValidationWarning(
                    field='cui',
                    rule='cui_checksum',
                    message=f'CUI checksum invalid: {extraction.cui} (failed Mod 11 validation)',
                    severity='medium'  # Medium: some old CIFs don't have checksums
                ))
        return warnings

    def _validate_checksum(self, digits: str) -> bool:
        """Romanian CIF Mod 11 checksum validation."""
        if len(digits) < 2:
            return False

        weights = [7, 5, 3, 2, 1, 7, 5, 3, 2]
        control = int(digits[-1])
        digits_to_check = digits[:-1].zfill(9)

        checksum = sum(int(d) * w for d, w in zip(digits_to_check, weights))
        remainder = checksum % 11
        expected = 0 if remainder == 10 else remainder

        return control == expected
```

**Test Cases:**
- R010562600 → Checksum validation
- R011201891 → Checksum validation
- R012345678 → Warning (invalid checksum)
- R01234 → Warning (too short)

### 7. Date Validity Validation

**Rule:** Date must not be in future, not older than 10 years.

**Implementation:**
```python
class DateValidityRule(ValidationRule):
    def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]:
        warnings = []
        if extraction.receipt_date:
            today = date.today()

            # Check future date
            if extraction.receipt_date > today:
                warnings.append(ValidationWarning(
                    field='receipt_date',
                    rule='date_future',
                    message=f'Date is in the future: {extraction.receipt_date}',
                    severity='high'
                ))

            # Check too old (10 years)
            cutoff_date = today.replace(year=today.year - 10)
            if extraction.receipt_date < cutoff_date:
                warnings.append(ValidationWarning(
                    field='receipt_date',
                    rule='date_too_old',
                    message=f'Date is older than 10 years: {extraction.receipt_date}',
                    severity='medium'
                ))
        return warnings
```

**Test Cases:**
- 2025-12-30 (today) → Valid
- 2025-10-11 → Valid
- 2026-01-01 → Warning (future)
- 2015-12-31 → Valid (exactly 10 years)
- 2014-12-31 → Warning (too old)

---

## Acceptance Criteria

### Critical Success Criteria (Must Pass)

✅ **AC-1:** Five-Holding receipt extracts correct values
- **Given:** Production PDF receipt (Five-Holding, 85.99 LEI)
- **When:** OCR processes with new validation
- **Then:**
  - TOTAL = 85.99 LEI (NOT 859,762.16)
  - TVA = 14.92 LEI (NOT 149,214.92)
  - CUI = R010562600
  - Overall confidence >= 90%

✅ **AC-2:** Save works with validation warnings
- **Given:** Receipt with low confidence (75%)
- **When:** User clicks Save
- **Then:**
  - Warnings displayed in UI
  - Save button enabled
  - Receipt saved with `needs_manual_review=TRUE`

✅ **AC-3:** Cross-validation: CARD + NUMERAR = TOTAL
- **Given:** Receipt with CARD=50, NUMERAR=35.99
- **When:** OCR extracts TOTAL=85.98 (off by 0.01)
- **Then:**
  - Warning displayed: "Payment sum (85.99) ≠ TOTAL (85.98)"
  - Suggested value: 85.99
  - Auto-corrected if confidence < 80%

✅ **AC-4:** Cross-validation: Σ(TVA entries) = TVA TOTAL
- **Given:** Receipt with TVA A=10.00, TVA B=4.92
- **When:** OCR extracts TVA TOTAL=14.90 (off by 0.02)
- **Then:**
  - Warning displayed: "TVA entries sum (14.92) ≠ TVA TOTAL (14.90)"
  - Auto-corrected to 14.92

✅ **AC-5:** CUI Mod 11 validation works
- **Given:** Receipt with CUI R010562600
- **When:** OCR processes
- **Then:**
  - CUI validated against Mod 11 checksum
  - If invalid, warning displayed
  - Format normalized to "RO" prefix

### Secondary Criteria (Nice-to-Have)

🔲 **AC-S1:** Medium OCR performs better than Heavy
- **Given:** 10 clear PDF receipts
- **When:** Processed with Light → Medium → Tesseract
- **Then:**
  - No 10x magnitude errors
  - Average confidence >= 90%
  - Processing time < 5s

🔲 **AC-S2:** Validation warnings show in UI
- **Given:** Receipt with 3 validation warnings
- **When:** OCR completes
- **Then:**
  - Warning section displayed
  - Each warning shows: field, message, severity
  - Suggested values displayed if available

---

## Testing Strategy

### Unit Tests (~300 lines)

**File:** `backend/modules/data_entry/tests/test_ocr_validation.py`

**Test Coverage:**
```python
# Amount validation
test_amount_range_valid()
test_amount_range_too_small()
test_amount_range_too_large()
test_amount_decimal_places()

# TVA validation
test_tva_ratio_valid()
test_tva_ratio_too_low()
test_tva_ratio_too_high()
test_tva_greater_than_total()
test_tva_entries_sum_matches()
test_tva_entries_sum_mismatch()

# Payment validation
test_payment_sum_matches()
test_payment_sum_mismatch_within_tolerance()
test_payment_sum_mismatch_auto_corrected()

# CUI validation
test_cui_checksum_valid()
test_cui_checksum_invalid()
test_cui_length_invalid()
test_cui_normalization()

# Date validation
test_date_valid()
test_date_future()
test_date_too_old()

# Inter-OCR consistency
test_inter_ocr_consistency_valid()
test_inter_ocr_consistency_10x_difference()

# Validation engine
test_validation_engine_no_warnings()
test_validation_engine_multiple_warnings()
test_validation_engine_auto_corrections()
test_needs_manual_review_flag()
```

### Integration Tests (~200 lines)

**File:** `backend/modules/data_entry/tests/test_ocr_validation_integration.py`

**Test Coverage:**
```python
# Real receipts
test_five_holding_receipt()           # Production case (85.99 not 859,762.16)
test_omv_receipt()                    # Clear PDF, Light OCR only
test_kaufland_receipt()               # Faded thermal, Medium OCR
test_mega_image_receipt()             # Multiple TVA entries

# OCR pipeline
test_light_ocr_high_confidence_skips_medium()
test_light_ocr_low_confidence_runs_medium()
test_medium_ocr_replaces_heavy()
test_validation_runs_after_merge()

# API responses
test_api_returns_validation_warnings()
test_api_returns_needs_manual_review_flag()
test_api_returns_inter_ocr_ratio()
test_api_auto_corrects_amount_from_payments()

# Edge cases
test_no_ocr_engines_available()
test_pdf_with_multiple_pages()
test_receipt_with_no_tva()
test_receipt_with_no_payment_methods()
```

### Manual Testing Checklist

1. **Upload Five-Holding receipt PDF** (production case)
   - [ ] Verify TOTAL = 85.99 (not 859,762.16)
   - [ ] Verify TVA = 14.92 (not 149,214.92)
   - [ ] Verify no validation warnings
   - [ ] Verify overall confidence >= 90%

2. **Upload faded thermal receipt photo**
   - [ ] Verify Medium OCR used (not Heavy)
   - [ ] Verify readable text extracted
   - [ ] Verify no digit concatenation

3. **Upload receipt with payment methods**
   - [ ] Verify CARD + NUMERAR displayed
   - [ ] Verify sum matches TOTAL
   - [ ] If mismatch, verify warning displayed

4. **Upload receipt with multiple TVA entries**
   - [ ] Verify all TVA entries extracted
   - [ ] Verify sum matches TVA TOTAL
   - [ ] If mismatch, verify warning displayed

5. **Submit receipt with warnings**
   - [ ] Verify Save button enabled
   - [ ] Verify warnings displayed in UI
   - [ ] Verify `needs_manual_review` flag set

6. **Filter receipts by "Needs Review"**
   - [ ] Verify filter shows flagged receipts
   - [ ] Verify supervisor can review

---

## Risks and Mitigations

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| **Medium OCR still causes errors** | Medium | High | Keep Tesseract as Step 3 fallback; validation catches issues |
| **CUI Mod 11 validation too strict** | Medium | Low | Use warning (not error); allow override; some old CIFs don't have checksums |
| **Validation rules too permissive** | Low | Medium | Start conservative, tune based on production data |
| **Validation rules too strict** | Medium | Low | Non-blocking warnings allow user override |
| **Performance impact** | Low | Low | Validation is fast (<10ms); OCR dominates processing time |
| **Breaking changes to API** | Low | High | Add new fields, keep existing fields unchanged; frontend optional |
| **Database migration issues** | Low | Medium | Use NULL default (not FALSE); test on staging first |

---

## Out of Scope

**Explicitly NOT included in this feature:**

1. ❌ **Reprocessing existing receipts** - Only new uploads validated
2. ❌ **Machine learning OCR improvements** - Use existing PaddleOCR/Tesseract
3. ❌ **Custom OCR training** - Generic models only
4. ❌ **Approval workflow changes** - Validation is separate from approval
5. ❌ **Automatic approval** - Always requires supervisor review
6. ❌ **Advanced validation rules** - Only basic sanity checks
7. ❌ **Multi-currency support** - RON only for now
8. ❌ **Historical receipt validation** - Phase 2 feature
9. ❌ **OCR confidence tuning** - Accept engine defaults
10. ❌ **Frontend validation logic** - Backend only (frontend displays)

---

## Open Questions

### Q1: Should we keep Heavy preprocessing as fallback?

**Answer:** No. Remove completely. Evidence shows it causes more harm than good on clear PDFs. Medium preprocessing handles mixed-quality images better.

### Q2: What tolerance for payment sum validation?

**Answer:** ±0.02 RON (2 cents). Romanian receipts use 2 decimal places. This handles rounding errors.

### Q3: Should CUI validation be blocking or warning?

**Answer:** Warning only. Some old Romanian CIFs (pre-2000) don't have Mod 11 checksums. Also, OCR may extract digits incorrectly.

### Q4: What if Light OCR has high confidence but wrong values?

**Answer:** Validation catches this. If Light OCR extracts 859,762.16 with 98% confidence, amount_range rule flags it (>100,000 limit). User sees warning.

### Q5: Should we reprocess existing receipts with new validation?

**Answer:** No. Too risky and time-consuming. Apply to new uploads only. If user wants to re-validate old receipt, they can re-upload.

### Q6: What about receipts with no payment methods?

**Answer:** No validation warning. Not all receipts show CARD/NUMERAR breakdown (especially older thermal receipts). Only validate if payment methods are extracted.

### Q7: Should validation auto-correct or just warn?

**Answer:** Both. Auto-correct obvious errors (TOTAL from payment sum if confidence < 80%). Warn for ambiguous cases. Never silently change high-confidence values.

### Q8: How to handle receipts from future (clock skew)?

**Answer:** Warning only (not error). Allow up to 1 day in future (±24h tolerance) for clock skew. Beyond that, warn user.

---

## Estimated Complexity

**Overall:** High
**Justification:**

- **File Count:** 6 modified, 3 created, 1 migration = 10 files
- **Line Changes:** ~1,135 lines (400 new validation, 300 tests, 200 integration tests, 235 modifications)
- **Risk Level:** Medium (core OCR pipeline changes, but validation is additive)
- **Testing:** 15-20 new test cases, manual testing required
- **Dependencies:** None (uses existing OCR engines)
- **Complexity Factors:**
  - Multi-layer validation logic
  - Romanian CIF checksum algorithm
  - Cross-field validation dependencies
  - Inter-OCR comparison logic
  - Auto-correction logic
  - Frontend integration
  - Database migration

**Estimated Effort:** 2-3 days
- Day 1: Validation engine + unit tests
- Day 2: OCR pipeline integration + medium preprocessing
- Day 3: Frontend integration + manual testing + bug fixes

---

## Dependencies

### External Libraries
- ✅ `cv2` (OpenCV) - Already installed
- ✅ `numpy` - Already installed
- ✅ `paddleocr` - Already installed
- ✅ `tesseract` - Already installed
- ✅ `pydantic` - Already installed
- ✅ `sqlalchemy` - Already installed

### Internal Modules
- ✅ `backend/modules/data_entry/services/ocr_service.py`
- ✅ `backend/modules/data_entry/services/ocr_extractor.py`
- ✅ `backend/modules/data_entry/services/image_preprocessor.py`
- ✅ `backend/modules/data_entry/routers/ocr.py`
- ✅ `backend/modules/data_entry/schemas/ocr.py`
- ✅ `backend/modules/data_entry/db/models/receipt.py`

### Database Schema Changes
- ✅ Add `needs_manual_review` column to `receipts` table (nullable BOOLEAN)
- ✅ Alembic migration required

---

## Implementation Notes

### Priority Order (Recommended)

1. **Phase 1: Core Validation (Day 1)**
   - Create `ocr/validation.py` module
   - Implement validation rules (amount, TVA, payment, CUI, date)
   - Write unit tests
   - **Checkpoint:** All tests pass

2. **Phase 2: OCR Integration (Day 2 Morning)**
   - Add `preprocess_medium()` to image_preprocessor
   - Update `_merge_extractions()` with validation-aware logic
   - Remove/deprecate `preprocess_heavy()`
   - **Checkpoint:** Five-Holding receipt extracts correctly

3. **Phase 3: API Updates (Day 2 Afternoon)**
   - Update `ExtractionResult` dataclass with validation fields
   - Update API schemas (ocr.py, routers/ocr.py)
   - Add database migration
   - **Checkpoint:** API returns validation warnings

4. **Phase 4: Integration Testing (Day 3 Morning)**
   - Write integration tests
   - Test with real receipts (Five-Holding, OMV, Kaufland)
   - **Checkpoint:** All integration tests pass

5. **Phase 5: Frontend & Polish (Day 3 Afternoon)**
   - Update Vue components to display warnings
   - Add "Needs Review" filter
   - Manual testing
   - Bug fixes
   - **Checkpoint:** Production-ready

### Code Quality Standards

- ✅ Type hints for all functions
- ✅ Docstrings for all public methods
- ✅ Unit test coverage >90%
- ✅ Integration tests for critical paths
- ✅ Print statements for debugging (will be converted to logging later)
- ✅ Follow existing code patterns (SQLModel, Pydantic v2, FastAPI)

### Performance Considerations

- **Validation overhead:** <10ms per receipt (negligible vs. OCR time)
- **Medium preprocessing:** Similar speed to Heavy (~500ms)
- **Database migration:** Non-blocking (adds NULL column)
- **Frontend impact:** Minimal (only displays warnings)

---

## Related Documentation

### Project Context
- **CLAUDE.md:** Data Entry module instructions
- **docs/data-entry/DATA-ENTRY-MODULE.md:** Module architecture
- **docs/ARCHITECTURE-DECISIONS.md:** Ultrathin monolith rationale

### Technical References
- **Romanian CIF validation:** https://ro.wikipedia.org/wiki/Cod_de_identificare_fiscal%C4%83
- **OpenCV preprocessing:** https://docs.opencv.org/4.x/d7/d4d/tutorial_py_thresholding.html
- **PaddleOCR docs:** https://github.com/PaddlePaddle/PaddleOCR

### Similar Features
- **Payment methods extraction:** Already implemented in `ocr_extractor.py:1361`
- **TVA entries extraction:** Already implemented in `ocr_extractor.py:820`
- **Cross-validation logic:** Pattern from `_cross_validate_and_calculate_amount` (lines 468-557)

---

## Summary

This specification provides a comprehensive solution to fix critical OCR data extraction issues in the Data Entry module. The multi-layer validation system ensures data integrity while maintaining user flexibility through non-blocking warnings.

**Key Benefits:**
- ✅ Prevents 10,000x magnitude errors (85.99 vs 859,762.16)
- ✅ Validates cross-field dependencies (payment sum, TVA sum)
- ✅ Improves CUI extraction with Mod 11 checksum
- ✅ Replaces problematic Heavy OCR with Medium preprocessing
- ✅ Non-blocking warnings preserve user workflow
- ✅ Manual review flag helps supervisors prioritize

**Next Steps:**
1. Review and approve specification
2. Create feature branch: `feature/bon-ocr-validation`
3. Implement Phase 1 (validation engine)
4. Continue with Phases 2-5
5. Deploy to staging for testing
6. Monitor production for 1 week before full rollout

---

**Document Version:** 1.0
**Last Updated:** 2025-12-30
**Status:** Ready for Implementation
**Estimated Completion:** 2026-01-02 (3 working days)