# Feature Specification: OCR Data Extraction Validation System **Feature ID:** bon-ocr-validation **Priority:** Critical (P0 - Production Bug) **Complexity:** High **Estimated Effort:** 2-3 days **Created:** 2025-12-30 **Module:** Data Entry (backend/modules/data_entry/) --- ## Overview Fix critical OCR data extraction issue where PaddleOCR Heavy preprocessing (88% confidence) overwrites correct Light OCR (98% confidence) data with garbage values, causing 10,000x magnitude errors in production receipts. **Value Proposition:** Prevent incorrect financial data from entering the system, reduce manual corrections, improve user trust in OCR accuracy. --- ## Problem Statement ### Current Behavior (BROKEN) The OCR processing pipeline (`backend/modules/data_entry/services/ocr_service.py`) uses a 3-step adaptive approach: 1. **Step 1:** PaddleOCR + Light preprocessing (fast, high confidence) 2. **Step 2:** PaddleOCR + Heavy preprocessing (faded receipts) 3. **Step 3:** Tesseract (complement missing fields only) **Critical Bug:** The `_merge_extractions()` method (lines 240-386) blindly prefers higher OCR confidence scores WITHOUT validating actual extracted values. ### Real Production Example (Five-Holding Receipt) | Field | Light OCR (98%) ✅ | Heavy OCR (88%) ❌ | Final Result ❌ | |-------|-------------------|-------------------|-----------------| | **TOTAL** | 85.99 LEI | 859,762.16 LEI | **859,762.16** (10,000x error!) | | **TVA** | 14.92 LEI | 149,214.92 LEI | **149,214.92** (10,000x error!) | | **CUI** | R010562600 | (not found) | R010562600 | | **Date** | 2025-10-11 | 2025-10-11 | 2025-10-11 | | **Confidence** | 98% | 88% | 88% (wrong source!) | **Root Cause:** Heavy preprocessing causes digit concatenation on high-quality PDFs. The binarization and morphological operations (lines 153-164 in `image_preprocessor.py`) merge adjacent numbers, creating garbage values. ### Impact - **Data Integrity:** Incorrect amounts enter accounting system - **User Trust:** Users lose confidence in OCR accuracy - **Manual Work:** Requires manual verification of ALL OCR extractions - **Financial Risk:** Wrong amounts could be approved without review --- ## User Stories ### 1. As a user uploading a clear PDF receipt **I want** OCR to extract correct values from the first pass **So that** I don't have to manually correct obvious errors **Acceptance Criteria:** - Light OCR correctly extracts 85.99 LEI (not 859,762.16) - Heavy OCR is skipped when Light OCR confidence >= 90% - No 10,000x magnitude errors ### 2. As a user submitting a receipt with warnings **I want** to be able to save receipts with validation warnings **So that** I can submit for review even if OCR isn't perfect **Acceptance Criteria:** - Save button works with warnings (not blocked) - Receipt marked with `needs_manual_review=True` - Warnings displayed clearly in UI ### 3. As a supervisor reviewing receipts **I want** to see which receipts need manual review **So that** I can prioritize validation efforts **Acceptance Criteria:** - Filter by "Needs Review" flag - Validation warnings shown in detail view - Clear indication of which fields are suspicious ### 4. As a system validating cross-field data **I want** to validate CARD + NUMERAR = TOTAL **So that** payment methods match the total amount **Acceptance Criteria:** - Cross-validation: sum of payment methods = TOTAL (±0.02 RON tolerance) - If mismatch, flag for review - Auto-correct TOTAL from payment sum if confidence < 80% ### 5. As a system validating TVA entries **I want** to validate Σ(TVA entries) = TVA TOTAL **So that** individual TVA lines match the total TVA **Acceptance Criteria:** - Cross-validation: sum of TVA entries = TVA TOTAL (±0.02 RON tolerance) - TVA rate validation (5-24% of TOTAL) - If mismatch, flag for review --- ## Functional Requirements ### Core Requirements (Must-Have) #### 1. Multi-Layer Validation Pipeline **FR-1.1:** Absolute value sanity checks - Amount range: 0.01 - 100,000 RON - Max 2 decimal places - Date: not in future, not older than 10 years (2015+) - CUI: 6-10 digits, valid Mod 11 checksum **FR-1.2:** Cross-field correlation validation - TVA: 5-24% of TOTAL amount (Romanian rates: 5%, 9%, 11%, 19%, 21%) - Payment methods: CARD + NUMERAR = TOTAL (±0.02 RON tolerance) - Inter-OCR consistency: flag if values differ >10x between engines **FR-1.3:** Auto-correction logic - If TOTAL is obviously wrong (>10x payment sum), use payment sum - If TVA > TOTAL, recalculate TOTAL from TVA using reverse formula - Preserve high-confidence values from Light OCR over low-confidence Heavy OCR **FR-1.4:** Validation result structure ```python @dataclass class ValidationResult: is_valid: bool warnings: List[ValidationWarning] # Non-blocking issues errors: List[ValidationError] # Blocking issues (none for now) corrected_fields: Dict[str, Any] # Auto-corrected values needs_manual_review: bool # Flag for supervisor ``` #### 2. Replace Heavy with Medium OCR **FR-2.1:** Remove `preprocess_heavy()` method - Current Heavy: aggressive binarization causes digit concatenation - Reason: Destroys high-quality PDFs while trying to recover faded receipts **FR-2.2:** Add `preprocess_medium()` method - Moderate contrast enhancement (CLAHE clipLimit=2.0) - Light denoising (fastNlMeansDenoising h=6) - NO binarization, NO morphological operations - Preserve text boundaries on clear images **FR-2.3:** Update OCR pipeline - Step 1: Light preprocessing (unchanged) - Step 2: **Medium** preprocessing (replaces Heavy) - Step 3: Tesseract (unchanged) #### 3. Enhanced CUI Extraction **FR-3.1:** Romanian CIF validation algorithm - Implement Mod 11 checksum validation - Control digit formula: `sum(digit[i] * weight[i]) % 11` - Weights: `[7, 5, 3, 2, 1, 7, 5, 3, 2]` (right-to-left) **FR-3.2:** CUI format normalization - Always add "RO" prefix if missing - Remove spaces, dashes, dots - Validate length: 6-10 digits **FR-3.3:** Improved regex patterns ```python # Add OCR-tolerant patterns (current patterns are too strict) CUI_OCR_TOLERANT_PATTERNS = [ r'CIF[:\s]*R[O0]?\s*(\d[\d\s]{5,9})', # Spaces in CUI r'C[I1]F[:\s]*(\d[\d\s]{6,10})', # C1F (I→1 OCR error) r'C\.?\s*[I1]\.?\s*F\.?[:\s]*(\d+)', # C. I. F. (spaced) ] ``` #### 4. User Requirements Integration **FR-4.1:** Non-blocking validation warnings - Save button enabled even with warnings - User can override and submit - Warnings displayed clearly in UI **FR-4.2:** Manual review flag - Database field: `receipts.needs_manual_review` (BOOLEAN) - Set to `TRUE` if: - Any validation warning present - Overall confidence < 85% - Cross-validation fails **FR-4.3:** Apply to new uploads only - No reprocessing of existing receipts - Validation runs on OCR extraction (POST /api/ocr/extract) - Migration: add column with default NULL (not FALSE) ### Secondary Requirements (Nice-to-Have) **FR-S1:** Validation confidence scoring - Each validation rule contributes to score - Overall validation confidence: weighted average - Display in UI alongside OCR confidence **FR-S2:** Validation rule configurability - Move hardcoded thresholds to config - Allow per-company customization - Admin UI to adjust rules --- ## Technical Requirements ### Files to Create #### 1. `backend/modules/data_entry/services/ocr/validation.py` **Purpose:** Validation utilities and rule engine **Size:** ~400 lines **Key Classes:** - `ValidationRule` (base class) - `AmountRangeRule`, `TVARatioRule`, `PaymentSumRule`, `CUIChecksumRule` - `OCRValidationEngine` (orchestrator) **Example:** ```python @dataclass class ValidationWarning: """Non-blocking validation warning.""" field: str rule: str message: str severity: str # 'low', 'medium', 'high' suggested_value: Optional[Any] = None class ValidationRule(ABC): """Base validation rule.""" @abstractmethod def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]: pass class AmountRangeRule(ValidationRule): """Validate amount is in reasonable range (0.01 - 100,000 RON).""" def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]: warnings = [] if extraction.amount: if extraction.amount < Decimal('0.01'): warnings.append(ValidationWarning( field='amount', rule='amount_range', message=f'Amount {extraction.amount} is too small (< 0.01 RON)', severity='high' )) elif extraction.amount > Decimal('100000'): warnings.append(ValidationWarning( field='amount', rule='amount_range', message=f'Amount {extraction.amount} exceeds limit (> 100,000 RON)', severity='high' )) return warnings class OCRValidationEngine: """Orchestrate all validation rules.""" def __init__(self): self.rules = [ AmountRangeRule(), TVARatioRule(), PaymentSumRule(), InterOCRConsistencyRule(), CUIChecksumRule(), DateValidityRule(), ] def validate(self, extraction: ExtractionResult) -> ValidationResult: """Run all validation rules and return result.""" all_warnings = [] corrected_fields = {} for rule in self.rules: warnings = rule.validate(extraction) all_warnings.extend(warnings) # Apply auto-corrections corrections = rule.auto_correct(extraction) corrected_fields.update(corrections) needs_review = ( len(all_warnings) > 0 or extraction.overall_confidence < 0.85 ) return ValidationResult( is_valid=True, # Never block (warnings only) warnings=all_warnings, errors=[], corrected_fields=corrected_fields, needs_manual_review=needs_review ) ``` #### 2. `backend/modules/data_entry/tests/test_ocr_validation.py` **Purpose:** Unit tests for validation rules **Size:** ~300 lines **Coverage Target:** >90% **Test Cases:** - `test_amount_range_valid()` - 85.99 RON passes - `test_amount_range_too_high()` - 859,762.16 fails - `test_tva_ratio_valid()` - 14.92/85.99 = 17.3% passes - `test_tva_ratio_too_high()` - 149,214.92/859,762.16 = 17.3% but amounts wrong - `test_payment_sum_matches()` - CARD 50 + NUMERAR 35.99 = TOTAL 85.99 - `test_cui_checksum_valid()` - R010562600 passes Mod 11 - `test_cui_checksum_invalid()` - R010562601 fails Mod 11 - `test_inter_ocr_consistency()` - 85.99 vs 859,762.16 = 10,000x flag #### 3. `backend/modules/data_entry/tests/test_ocr_validation_integration.py` **Purpose:** Integration tests with full OCR pipeline **Size:** ~200 lines **Test Cases:** - `test_five_holding_receipt()` - Real production case (85.99 not 859,762.16) - `test_clear_pdf_uses_light_ocr()` - High-quality PDF skips Heavy - `test_faded_receipt_uses_medium_ocr()` - Thermal receipt uses Medium - `test_validation_warnings_in_response()` - API returns warnings - `test_manual_review_flag_set()` - Flag set when confidence < 85% ### Files to Modify #### 1. `backend/modules/data_entry/services/ocr_service.py` **Changes:** ~200 lines modified, ~100 lines added **Key Modifications:** **A. Replace `_merge_extractions()` (lines 240-386) with validation-aware version:** ```python def _merge_extractions( self, light: Optional[ExtractionResult], medium: Optional[ExtractionResult] # Renamed from 'tesseract' ) -> ExtractionResult: """ Merge extractions with VALIDATION-AWARE logic. NEW Strategy: 1. Run validation on both extractions 2. Prefer extraction with FEWER warnings (not just higher confidence) 3. For each field, pick value that passes validation 4. Flag inter-OCR inconsistencies (>10x difference) """ from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine validator = OCRValidationEngine() # Validate both extractions light_validation = validator.validate(light) if light else None medium_validation = validator.validate(medium) if medium else None result = ExtractionResult() # === AMOUNT (with validation check) === if light.amount and medium.amount: # Check for 10x inconsistency ratio = max(light.amount, medium.amount) / min(light.amount, medium.amount) if ratio > 10: print(f"[Merge] WARNING: Inter-OCR inconsistency: {light.amount} vs {medium.amount} ({ratio:.0f}x)", flush=True) # Prefer value that passes validation light_warnings = [w for w in light_validation.warnings if w.field == 'amount'] medium_warnings = [w for w in medium_validation.warnings if w.field == 'amount'] if len(light_warnings) < len(medium_warnings): result.amount = light.amount result.confidence_amount = light.confidence_amount print(f"[Merge] Using Light OCR amount: {light.amount} (fewer warnings)", flush=True) else: result.amount = medium.amount result.confidence_amount = medium.confidence_amount print(f"[Merge] Using Medium OCR amount: {medium.amount} (fewer warnings)", flush=True) else: # Normal merge: prefer higher confidence if light.confidence_amount >= medium.confidence_amount: result.amount = light.amount result.confidence_amount = light.confidence_amount else: result.amount = medium.amount result.confidence_amount = medium.confidence_amount elif light.amount: result.amount = light.amount result.confidence_amount = light.confidence_amount elif medium.amount: result.amount = medium.amount result.confidence_amount = medium.confidence_amount # ... (similar logic for other fields) return result ``` **B. Add `preprocess_medium()` call (replace Heavy):** ```python # Line ~130: Replace preprocess_heavy with preprocess_medium print("=" * 60, flush=True) print("[OCR] STEP 2: PaddleOCR + Medium preprocessing", flush=True) print("=" * 60, flush=True) medium_img = self.preprocessor.preprocess_medium(image) # NEW try: paddle_medium = self.ocr_engine._paddle_recognize(medium_img) # ... rest of processing ``` **C. Add validation to final result:** ```python # Line ~204: Add validation before returning if extraction: extraction = self._final_validation(extraction) # NEW: Run validation engine from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine validator = OCRValidationEngine() validation_result = validator.validate(extraction) # Apply auto-corrections for field, value in validation_result.corrected_fields.items(): setattr(extraction, field, value) # Store validation warnings (add to ExtractionResult) extraction.validation_warnings = validation_result.warnings extraction.needs_manual_review = validation_result.needs_manual_review ``` #### 2. `backend/modules/data_entry/services/ocr_extractor.py` **Changes:** ~50 lines modified, ~30 lines added **Key Modifications:** **A. Add validation fields to `ExtractionResult` (lines 10-50):** ```python @dataclass class ExtractionResult: """Structured extraction result from receipt.""" # ... existing fields ... # NEW: Validation results validation_warnings: List[dict] = field(default_factory=list) # List of warnings needs_manual_review: bool = False # Flag for supervisor review # NEW: Inter-OCR comparison data inter_ocr_ratio: Optional[float] = None # Ratio between Light/Heavy values inter_ocr_source_used: Optional[str] = None # 'light' or 'medium' ``` **B. Fix CLIENT CUI patterns (lines 253-272):** ```python # Current patterns are too strict - add OCR-tolerant versions CLIENT_CUI_PATTERNS = [ # ... existing patterns ... # NEW: OCR-tolerant patterns (r'CLIENT\s+C[I1UO]F\s*[:/]?\s*(?:R[O0])?(\d[\d\s]{5,9})', 0.96), # Spaces in CUI (r'C[I1]F\s+CLIENT\s*[:/]?\s*(?:R[O0])?(\d[\d\s]{5,9})', 0.96), # Reversed format (r'CLIENT.*?(?:R[O0])?(\d{6,10})\s*\n', 0.90), # CUI on next line ] ``` **C. Add CUI normalization and validation:** ```python def _normalize_cui(self, cui: Optional[str]) -> Optional[str]: """Normalize CUI format and validate checksum.""" if not cui: return None # Remove non-digits digits = re.sub(r'\D', '', cui) # Validate length if not (6 <= len(digits) <= 10): return None # Validate Mod 11 checksum (Romanian CIF algorithm) if not self._validate_cui_checksum(digits): print(f"[CUI Validation] Invalid checksum: {digits}", flush=True) return None # Add RO prefix return f"RO{digits}" def _validate_cui_checksum(self, digits: str) -> bool: """Validate Romanian CIF Mod 11 checksum.""" if len(digits) < 2: return False # Weights: 7, 5, 3, 2, 1, 7, 5, 3, 2 (right-to-left) weights = [7, 5, 3, 2, 1, 7, 5, 3, 2] # Get control digit (last digit) control = int(digits[-1]) # Calculate checksum (all digits except last) digits_to_check = digits[:-1].zfill(9) # Pad with zeros if needed checksum = sum(int(d) * w for d, w in zip(digits_to_check, weights)) # Mod 11 remainder = checksum % 11 expected_control = 0 if remainder == 10 else remainder return control == expected_control ``` #### 3. `backend/modules/data_entry/services/image_preprocessor.py` **Changes:** ~80 lines added **Key Modifications:** **A. Add `preprocess_medium()` method (after line 166):** ```python def preprocess_medium(self, image: np.ndarray) -> np.ndarray: """ Medium preprocessing for MIXED-QUALITY images. Balance between Light (too gentle) and Heavy (too aggressive). Use cases: - Moderately faded receipts - Photos with uneven lighting - Scans with slight blur Preprocessing steps: - Moderate contrast enhancement (CLAHE clipLimit=2.0) - Light denoising (fastNlMeansDenoising h=6) - Gentle sharpening - NO binarization (preserves text boundaries) - NO morphological operations (avoids digit concatenation) """ # 0. Add safety padding image = self._add_safety_padding(image) # 1. Grayscale if len(image.shape) == 3: gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) else: gray = image.copy() # 2. Scale (same as Light) height, width = gray.shape max_side = max(height, width) if max_side > 4000: scale = 4000 / max_side gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_AREA) height, width = gray.shape if width < 1500: scale = 1500 / width new_width = int(width * scale) new_height = int(height * scale) if max(new_width, new_height) > 4000: scale = 4000 / max(new_width, new_height) gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_CUBIC) # 3. Deskew gray = self._deskew(gray) # 4. Moderate contrast enhancement clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8)) enhanced = clahe.apply(gray) # 5. Light denoising (less aggressive than Heavy) denoised = cv2.fastNlMeansDenoising(enhanced, h=6, templateWindowSize=7, searchWindowSize=15) # 6. Gentle sharpening gaussian = cv2.GaussianBlur(denoised, (0, 0), 1.0) sharpened = cv2.addWeighted(denoised, 1.3, gaussian, -0.3, 0) # NO binarization, NO morphological operations # This preserves text boundaries and avoids digit concatenation return sharpened ``` **B. Mark `preprocess_heavy()` as deprecated:** ```python def preprocess_heavy(self, image: np.ndarray) -> np.ndarray: """ Heavy preprocessing for FADED thermal receipts. ⚠️ DEPRECATED: Use preprocess_medium() instead. Heavy preprocessing causes digit concatenation on clear PDFs. Kept for backward compatibility only. """ # ... existing code (unchanged) ``` #### 4. `backend/modules/data_entry/routers/ocr.py` **Changes:** ~40 lines modified **Key Modifications:** **A. Update `ExtractionData` schema instantiation (lines 106-128):** ```python # Add validation warnings to response validation_warnings_list = [ { 'field': w.field, 'rule': w.rule, 'message': w.message, 'severity': w.severity, 'suggested_value': w.suggested_value } for w in result.validation_warnings ] if hasattr(result, 'validation_warnings') else [] data = ExtractionData( # ... existing fields ... # NEW: Validation fields validation_warnings=validation_warnings_list, needs_manual_review=getattr(result, 'needs_manual_review', False), inter_ocr_ratio=getattr(result, 'inter_ocr_ratio', None), inter_ocr_source_used=getattr(result, 'inter_ocr_source_used', None), ) ``` #### 5. `backend/modules/data_entry/schemas/ocr.py` **Changes:** ~20 lines added **Key Modifications:** **A. Add validation fields to `ExtractionData` (after line 57):** ```python class ValidationWarning(BaseModel): """Validation warning from OCR extraction.""" field: str = Field(description="Field name (e.g., 'amount', 'tva_total')") rule: str = Field(description="Rule name (e.g., 'amount_range', 'tva_ratio')") message: str = Field(description="Human-readable warning message") severity: str = Field(description="Severity: 'low', 'medium', 'high'") suggested_value: Optional[Any] = Field(default=None, description="Suggested corrected value") class ExtractionData(BaseModel): """Extracted receipt data from OCR.""" # ... existing fields ... # NEW: Validation results validation_warnings: List[ValidationWarning] = Field(default=[], description="Validation warnings") needs_manual_review: bool = Field(default=False, description="Flag for supervisor review") inter_ocr_ratio: Optional[float] = Field(default=None, description="Ratio between OCR engines (>10 = inconsistent)") inter_ocr_source_used: Optional[str] = Field(default=None, description="OCR engine used: 'light' or 'medium'") ``` #### 6. Database Migration: `backend/modules/data_entry/migrations/versions/XXX_add_needs_manual_review.py` **Purpose:** Add `needs_manual_review` column to `receipts` table **Size:** ~30 lines (Alembic migration) ```python """Add needs_manual_review flag to receipts Revision ID: XXX Create Date: 2025-12-30 """ from alembic import op import sqlalchemy as sa revision = 'XXX' down_revision = 'YYY' # Previous migration branch_labels = None depends_on = None def upgrade(): # Add column with default NULL (not FALSE) # NULL = not validated yet (old receipts) # FALSE = validated, no review needed # TRUE = validated, needs review op.add_column('receipts', sa.Column('needs_manual_review', sa.Boolean(), nullable=True)) def downgrade(): op.drop_column('receipts', 'needs_manual_review') ``` ### Frontend Integration Points #### 1. `src/modules/data-entry/views/receipts/ReceiptCreateView.vue` **Changes:** Display validation warnings below OCR results **Example:** ```vue ``` #### 2. `src/modules/data-entry/components/ocr/OCRPreview.vue` **Changes:** Add inter-OCR consistency indicator **Example:** ```vue ``` --- ## Design Decisions ### 1. Why Validation Warnings Instead of Errors? **Decision:** Use non-blocking warnings instead of blocking errors. **Rationale:** - User requirement: "Allow save with warnings" - OCR will never be 100% perfect - Users can override incorrect extractions - Supervisor review catches issues before approval **Trade-off:** Risk of bad data entering system vs. user frustration with blocked submissions. **Mitigation:** Manual review flag ensures supervisor catches issues. ### 2. Why Replace Heavy with Medium OCR? **Decision:** Remove Heavy preprocessing, add Medium preprocessing. **Rationale:** - **Heavy causes digit concatenation** on clear PDFs (production evidence) - Binarization destroys text boundaries on high-quality images - Morphological operations merge adjacent numbers (85.99 → 859,762.16) **Analysis of Heavy Preprocessing (lines 153-164 in `image_preprocessor.py`):** ```python # 7. Adaptive thresholding (binarization) - PROBLEM! binary = cv2.adaptiveThreshold( sharpened, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, blockSize=11, C=5 # Block size can merge nearby digits ) # 8. Morphological operations - COMPOUNDS THE PROBLEM! kernel_close = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2)) result = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel_close) # MORPH_CLOSE fills small gaps → merges adjacent numbers ``` **Alternative Considered:** Keep Heavy but add safeguards. **Rejected:** Too risky, no benefit for clear PDFs. ### 3. Why Romanian CIF Mod 11 Validation? **Decision:** Implement CIF checksum validation algorithm. **Rationale:** - Romanian CIFs have built-in checksum (last digit) - Validates extracted CUI is mathematically correct - Catches OCR digit errors (10562600 vs 10562601) **Algorithm:** Mod 11 checksum - Weights: [7, 5, 3, 2, 1, 7, 5, 3, 2] (right-to-left) - Formula: `sum(digit[i] * weight[i]) % 11` - Control digit: remainder (0 if remainder=10) **Example:** RO10562600 - Digits: 1,0,5,6,2,6,0,0,[0] - Checksum: 1×7 + 0×5 + 5×3 + 6×2 + 2×1 + 6×7 + 0×5 + 0×3 = 7+0+15+12+2+42+0+0 = 78 - 78 % 11 = 1 ≠ 0 → **INVALID!** (This CUI fails validation) **Note:** Some older CIFs may not have checksums (pre-2000). Validation is permissive (warning, not error). ### 4. Why Apply to New Uploads Only? **Decision:** Don't reprocess existing receipts. **Rationale:** - Migration impact: ~500 existing receipts in DB - Reprocessing cost: OCR is slow (~2-5s per receipt) - Risk: May change existing approved data - Benefit: Minimal (old receipts already reviewed) **Implementation:** Migration adds column with default NULL (not FALSE). --- ## Validation Rules Specification ### 1. Amount Range Validation **Rule:** Amount must be between 0.01 and 100,000 RON. **Implementation:** ```python class AmountRangeRule(ValidationRule): def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]: warnings = [] if extraction.amount: if extraction.amount < Decimal('0.01'): warnings.append(ValidationWarning( field='amount', rule='amount_range', message=f'Amount {extraction.amount} is too small (< 0.01 RON)', severity='high' )) elif extraction.amount > Decimal('100000'): warnings.append(ValidationWarning( field='amount', rule='amount_range', message=f'Amount {extraction.amount} exceeds limit (> 100,000 RON)', severity='high' )) # Check decimal places decimal_places = abs(extraction.amount.as_tuple().exponent) if decimal_places > 2: warnings.append(ValidationWarning( field='amount', rule='decimal_places', message=f'Amount has {decimal_places} decimal places (max 2)', severity='medium', suggested_value=extraction.amount.quantize(Decimal('0.01')) )) return warnings ``` **Test Cases:** - 0.00 RON → Warning (too small) - 0.01 RON → Valid - 85.99 RON → Valid - 100,000 RON → Valid - 100,001 RON → Warning (too large) - 859,762.16 RON → Warning (too large) - 85.999 RON → Warning (too many decimals) ### 2. TVA Ratio Validation **Rule:** TVA must be 5-24% of TOTAL amount. **Implementation:** ```python class TVARatioRule(ValidationRule): def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]: warnings = [] if extraction.tva_total and extraction.amount: # TVA cannot be greater than TOTAL if extraction.tva_total > extraction.amount: warnings.append(ValidationWarning( field='tva_total', rule='tva_greater_than_total', message=f'TVA ({extraction.tva_total}) cannot be greater than TOTAL ({extraction.amount})', severity='high', suggested_value=None # Will be auto-corrected by service )) else: # Check ratio ratio = extraction.tva_total / extraction.amount * Decimal('100') if ratio < Decimal('5'): warnings.append(ValidationWarning( field='tva_total', rule='tva_ratio_low', message=f'TVA is {ratio:.1f}% of total (expected 5-24%)', severity='medium' )) elif ratio > Decimal('24'): warnings.append(ValidationWarning( field='tva_total', rule='tva_ratio_high', message=f'TVA is {ratio:.1f}% of total (expected 5-24%)', severity='high' )) return warnings ``` **Test Cases:** - TVA=14.92, TOTAL=85.99 → 17.3% → Valid - TVA=149,214.92, TOTAL=859,762.16 → 17.3% → Both values wrong (caught by amount_range) - TVA=4.00, TOTAL=100.00 → 4% → Warning (too low) - TVA=100.00, TOTAL=85.99 → 116% → Warning (impossible!) ### 3. Payment Sum Validation **Rule:** CARD + NUMERAR must equal TOTAL (±0.02 RON tolerance). **Implementation:** ```python class PaymentSumRule(ValidationRule): def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]: warnings = [] if extraction.payment_methods and extraction.amount: payment_sum = sum(pm['amount'] for pm in extraction.payment_methods) difference = abs(payment_sum - extraction.amount) if difference > Decimal('0.02'): warnings.append(ValidationWarning( field='amount', rule='payment_sum_mismatch', message=f'Payment methods sum ({payment_sum}) ≠ TOTAL ({extraction.amount}), diff={difference}', severity='high', suggested_value=payment_sum )) return warnings def auto_correct(self, extraction: ExtractionResult) -> Dict[str, Any]: """Auto-correct TOTAL from payment sum if confidence < 80%.""" corrections = {} if extraction.payment_methods and extraction.amount: payment_sum = sum(pm['amount'] for pm in extraction.payment_methods) difference = abs(payment_sum - extraction.amount) if difference > Decimal('0.02') and extraction.confidence_amount < 0.80: corrections['amount'] = payment_sum print(f"[Auto-Correct] TOTAL corrected: {extraction.amount} → {payment_sum} (from payment methods)", flush=True) return corrections ``` **Test Cases:** - CARD=50, NUMERAR=35.99, TOTAL=85.99 → Valid - CARD=50, NUMERAR=35.97, TOTAL=85.99 → Diff=0.02 → Valid (tolerance) - CARD=50, NUMERAR=35.00, TOTAL=85.99 → Diff=0.99 → Warning ### 4. TVA Entries Sum Validation **Rule:** Σ(TVA entries) must equal TVA TOTAL (±0.02 RON tolerance). **Implementation:** ```python class TVAEntriesSumRule(ValidationRule): def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]: warnings = [] if extraction.tva_entries and extraction.tva_total: entries_sum = sum(e['amount'] for e in extraction.tva_entries) difference = abs(entries_sum - extraction.tva_total) if difference > Decimal('0.02'): warnings.append(ValidationWarning( field='tva_total', rule='tva_entries_sum_mismatch', message=f'TVA entries sum ({entries_sum}) ≠ TVA TOTAL ({extraction.tva_total}), diff={difference}', severity='medium', suggested_value=entries_sum )) return warnings def auto_correct(self, extraction: ExtractionResult) -> Dict[str, Any]: """Use entries sum as TVA TOTAL if mismatch.""" corrections = {} if extraction.tva_entries and extraction.tva_total: entries_sum = sum(e['amount'] for e in extraction.tva_entries) difference = abs(entries_sum - extraction.tva_total) if difference > Decimal('0.02'): corrections['tva_total'] = entries_sum print(f"[Auto-Correct] TVA TOTAL corrected: {extraction.tva_total} → {entries_sum} (from entries)", flush=True) return corrections ``` **Test Cases:** - Entries=[A:19%:14.92], TOTAL=14.92 → Valid - Entries=[A:19%:10.00, B:9%:4.92], TOTAL=14.92 → Valid - Entries=[A:19%:14.92], TOTAL=14.94 → Diff=0.02 → Valid (tolerance) - Entries=[A:19%:14.92], TOTAL=15.00 → Diff=0.08 → Warning ### 5. Inter-OCR Consistency Validation **Rule:** Flag if values differ >10x between OCR engines. **Implementation:** ```python class InterOCRConsistencyRule(ValidationRule): def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]: """This rule is applied during merge, stores ratio in extraction.""" warnings = [] if hasattr(extraction, 'inter_ocr_ratio') and extraction.inter_ocr_ratio: if extraction.inter_ocr_ratio > 10: warnings.append(ValidationWarning( field='amount', rule='inter_ocr_inconsistency', message=f'Large inconsistency between OCR engines ({extraction.inter_ocr_ratio:.0f}x difference)', severity='high' )) return warnings ``` **Test Cases:** - Light=85.99, Medium=86.00 → Ratio=1.00 → Valid - Light=85.99, Medium=90.00 → Ratio=1.05 → Valid - Light=85.99, Medium=859.76 → Ratio=10.00 → Valid (edge case) - Light=85.99, Medium=859,762.16 → Ratio=10,000 → Warning! ### 6. CUI Checksum Validation **Rule:** Validate Romanian CIF Mod 11 checksum. **Implementation:** ```python class CUIChecksumRule(ValidationRule): def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]: warnings = [] if extraction.cui: # Normalize CUI digits = re.sub(r'\D', '', extraction.cui) # Validate length if not (6 <= len(digits) <= 10): warnings.append(ValidationWarning( field='cui', rule='cui_length', message=f'CUI length invalid: {len(digits)} digits (expected 6-10)', severity='medium' )) return warnings # Validate Mod 11 checksum if not self._validate_checksum(digits): warnings.append(ValidationWarning( field='cui', rule='cui_checksum', message=f'CUI checksum invalid: {extraction.cui} (failed Mod 11 validation)', severity='medium' # Medium: some old CIFs don't have checksums )) return warnings def _validate_checksum(self, digits: str) -> bool: """Romanian CIF Mod 11 checksum validation.""" if len(digits) < 2: return False weights = [7, 5, 3, 2, 1, 7, 5, 3, 2] control = int(digits[-1]) digits_to_check = digits[:-1].zfill(9) checksum = sum(int(d) * w for d, w in zip(digits_to_check, weights)) remainder = checksum % 11 expected = 0 if remainder == 10 else remainder return control == expected ``` **Test Cases:** - R010562600 → Checksum validation - R011201891 → Checksum validation - R012345678 → Warning (invalid checksum) - R01234 → Warning (too short) ### 7. Date Validity Validation **Rule:** Date must not be in future, not older than 10 years. **Implementation:** ```python class DateValidityRule(ValidationRule): def validate(self, extraction: ExtractionResult) -> List[ValidationWarning]: warnings = [] if extraction.receipt_date: today = date.today() # Check future date if extraction.receipt_date > today: warnings.append(ValidationWarning( field='receipt_date', rule='date_future', message=f'Date is in the future: {extraction.receipt_date}', severity='high' )) # Check too old (10 years) cutoff_date = today.replace(year=today.year - 10) if extraction.receipt_date < cutoff_date: warnings.append(ValidationWarning( field='receipt_date', rule='date_too_old', message=f'Date is older than 10 years: {extraction.receipt_date}', severity='medium' )) return warnings ``` **Test Cases:** - 2025-12-30 (today) → Valid - 2025-10-11 → Valid - 2026-01-01 → Warning (future) - 2015-12-31 → Valid (exactly 10 years) - 2014-12-31 → Warning (too old) --- ## Acceptance Criteria ### Critical Success Criteria (Must Pass) ✅ **AC-1:** Five-Holding receipt extracts correct values - **Given:** Production PDF receipt (Five-Holding, 85.99 LEI) - **When:** OCR processes with new validation - **Then:** - TOTAL = 85.99 LEI (NOT 859,762.16) - TVA = 14.92 LEI (NOT 149,214.92) - CUI = R010562600 - Overall confidence >= 90% ✅ **AC-2:** Save works with validation warnings - **Given:** Receipt with low confidence (75%) - **When:** User clicks Save - **Then:** - Warnings displayed in UI - Save button enabled - Receipt saved with `needs_manual_review=TRUE` ✅ **AC-3:** Cross-validation: CARD + NUMERAR = TOTAL - **Given:** Receipt with CARD=50, NUMERAR=35.99 - **When:** OCR extracts TOTAL=85.98 (off by 0.01) - **Then:** - Warning displayed: "Payment sum (85.99) ≠ TOTAL (85.98)" - Suggested value: 85.99 - Auto-corrected if confidence < 80% ✅ **AC-4:** Cross-validation: Σ(TVA entries) = TVA TOTAL - **Given:** Receipt with TVA A=10.00, TVA B=4.92 - **When:** OCR extracts TVA TOTAL=14.90 (off by 0.02) - **Then:** - Warning displayed: "TVA entries sum (14.92) ≠ TVA TOTAL (14.90)" - Auto-corrected to 14.92 ✅ **AC-5:** CUI Mod 11 validation works - **Given:** Receipt with CUI R010562600 - **When:** OCR processes - **Then:** - CUI validated against Mod 11 checksum - If invalid, warning displayed - Format normalized to "RO" prefix ### Secondary Criteria (Nice-to-Have) 🔲 **AC-S1:** Medium OCR performs better than Heavy - **Given:** 10 clear PDF receipts - **When:** Processed with Light → Medium → Tesseract - **Then:** - No 10x magnitude errors - Average confidence >= 90% - Processing time < 5s 🔲 **AC-S2:** Validation warnings show in UI - **Given:** Receipt with 3 validation warnings - **When:** OCR completes - **Then:** - Warning section displayed - Each warning shows: field, message, severity - Suggested values displayed if available --- ## Testing Strategy ### Unit Tests (~300 lines) **File:** `backend/modules/data_entry/tests/test_ocr_validation.py` **Test Coverage:** ```python # Amount validation test_amount_range_valid() test_amount_range_too_small() test_amount_range_too_large() test_amount_decimal_places() # TVA validation test_tva_ratio_valid() test_tva_ratio_too_low() test_tva_ratio_too_high() test_tva_greater_than_total() test_tva_entries_sum_matches() test_tva_entries_sum_mismatch() # Payment validation test_payment_sum_matches() test_payment_sum_mismatch_within_tolerance() test_payment_sum_mismatch_auto_corrected() # CUI validation test_cui_checksum_valid() test_cui_checksum_invalid() test_cui_length_invalid() test_cui_normalization() # Date validation test_date_valid() test_date_future() test_date_too_old() # Inter-OCR consistency test_inter_ocr_consistency_valid() test_inter_ocr_consistency_10x_difference() # Validation engine test_validation_engine_no_warnings() test_validation_engine_multiple_warnings() test_validation_engine_auto_corrections() test_needs_manual_review_flag() ``` ### Integration Tests (~200 lines) **File:** `backend/modules/data_entry/tests/test_ocr_validation_integration.py` **Test Coverage:** ```python # Real receipts test_five_holding_receipt() # Production case (85.99 not 859,762.16) test_omv_receipt() # Clear PDF, Light OCR only test_kaufland_receipt() # Faded thermal, Medium OCR test_mega_image_receipt() # Multiple TVA entries # OCR pipeline test_light_ocr_high_confidence_skips_medium() test_light_ocr_low_confidence_runs_medium() test_medium_ocr_replaces_heavy() test_validation_runs_after_merge() # API responses test_api_returns_validation_warnings() test_api_returns_needs_manual_review_flag() test_api_returns_inter_ocr_ratio() test_api_auto_corrects_amount_from_payments() # Edge cases test_no_ocr_engines_available() test_pdf_with_multiple_pages() test_receipt_with_no_tva() test_receipt_with_no_payment_methods() ``` ### Manual Testing Checklist 1. **Upload Five-Holding receipt PDF** (production case) - [ ] Verify TOTAL = 85.99 (not 859,762.16) - [ ] Verify TVA = 14.92 (not 149,214.92) - [ ] Verify no validation warnings - [ ] Verify overall confidence >= 90% 2. **Upload faded thermal receipt photo** - [ ] Verify Medium OCR used (not Heavy) - [ ] Verify readable text extracted - [ ] Verify no digit concatenation 3. **Upload receipt with payment methods** - [ ] Verify CARD + NUMERAR displayed - [ ] Verify sum matches TOTAL - [ ] If mismatch, verify warning displayed 4. **Upload receipt with multiple TVA entries** - [ ] Verify all TVA entries extracted - [ ] Verify sum matches TVA TOTAL - [ ] If mismatch, verify warning displayed 5. **Submit receipt with warnings** - [ ] Verify Save button enabled - [ ] Verify warnings displayed in UI - [ ] Verify `needs_manual_review` flag set 6. **Filter receipts by "Needs Review"** - [ ] Verify filter shows flagged receipts - [ ] Verify supervisor can review --- ## Risks and Mitigations | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | **Medium OCR still causes errors** | Medium | High | Keep Tesseract as Step 3 fallback; validation catches issues | | **CUI Mod 11 validation too strict** | Medium | Low | Use warning (not error); allow override; some old CIFs don't have checksums | | **Validation rules too permissive** | Low | Medium | Start conservative, tune based on production data | | **Validation rules too strict** | Medium | Low | Non-blocking warnings allow user override | | **Performance impact** | Low | Low | Validation is fast (<10ms); OCR dominates processing time | | **Breaking changes to API** | Low | High | Add new fields, keep existing fields unchanged; frontend optional | | **Database migration issues** | Low | Medium | Use NULL default (not FALSE); test on staging first | --- ## Out of Scope **Explicitly NOT included in this feature:** 1. ❌ **Reprocessing existing receipts** - Only new uploads validated 2. ❌ **Machine learning OCR improvements** - Use existing PaddleOCR/Tesseract 3. ❌ **Custom OCR training** - Generic models only 4. ❌ **Approval workflow changes** - Validation is separate from approval 5. ❌ **Automatic approval** - Always requires supervisor review 6. ❌ **Advanced validation rules** - Only basic sanity checks 7. ❌ **Multi-currency support** - RON only for now 8. ❌ **Historical receipt validation** - Phase 2 feature 9. ❌ **OCR confidence tuning** - Accept engine defaults 10. ❌ **Frontend validation logic** - Backend only (frontend displays) --- ## Open Questions ### Q1: Should we keep Heavy preprocessing as fallback? **Answer:** No. Remove completely. Evidence shows it causes more harm than good on clear PDFs. Medium preprocessing handles mixed-quality images better. ### Q2: What tolerance for payment sum validation? **Answer:** ±0.02 RON (2 cents). Romanian receipts use 2 decimal places. This handles rounding errors. ### Q3: Should CUI validation be blocking or warning? **Answer:** Warning only. Some old Romanian CIFs (pre-2000) don't have Mod 11 checksums. Also, OCR may extract digits incorrectly. ### Q4: What if Light OCR has high confidence but wrong values? **Answer:** Validation catches this. If Light OCR extracts 859,762.16 with 98% confidence, amount_range rule flags it (>100,000 limit). User sees warning. ### Q5: Should we reprocess existing receipts with new validation? **Answer:** No. Too risky and time-consuming. Apply to new uploads only. If user wants to re-validate old receipt, they can re-upload. ### Q6: What about receipts with no payment methods? **Answer:** No validation warning. Not all receipts show CARD/NUMERAR breakdown (especially older thermal receipts). Only validate if payment methods are extracted. ### Q7: Should validation auto-correct or just warn? **Answer:** Both. Auto-correct obvious errors (TOTAL from payment sum if confidence < 80%). Warn for ambiguous cases. Never silently change high-confidence values. ### Q8: How to handle receipts from future (clock skew)? **Answer:** Warning only (not error). Allow up to 1 day in future (±24h tolerance) for clock skew. Beyond that, warn user. --- ## Estimated Complexity **Overall:** High **Justification:** - **File Count:** 6 modified, 3 created, 1 migration = 10 files - **Line Changes:** ~1,135 lines (400 new validation, 300 tests, 200 integration tests, 235 modifications) - **Risk Level:** Medium (core OCR pipeline changes, but validation is additive) - **Testing:** 15-20 new test cases, manual testing required - **Dependencies:** None (uses existing OCR engines) - **Complexity Factors:** - Multi-layer validation logic - Romanian CIF checksum algorithm - Cross-field validation dependencies - Inter-OCR comparison logic - Auto-correction logic - Frontend integration - Database migration **Estimated Effort:** 2-3 days - Day 1: Validation engine + unit tests - Day 2: OCR pipeline integration + medium preprocessing - Day 3: Frontend integration + manual testing + bug fixes --- ## Dependencies ### External Libraries - ✅ `cv2` (OpenCV) - Already installed - ✅ `numpy` - Already installed - ✅ `paddleocr` - Already installed - ✅ `tesseract` - Already installed - ✅ `pydantic` - Already installed - ✅ `sqlalchemy` - Already installed ### Internal Modules - ✅ `backend/modules/data_entry/services/ocr_service.py` - ✅ `backend/modules/data_entry/services/ocr_extractor.py` - ✅ `backend/modules/data_entry/services/image_preprocessor.py` - ✅ `backend/modules/data_entry/routers/ocr.py` - ✅ `backend/modules/data_entry/schemas/ocr.py` - ✅ `backend/modules/data_entry/db/models/receipt.py` ### Database Schema Changes - ✅ Add `needs_manual_review` column to `receipts` table (nullable BOOLEAN) - ✅ Alembic migration required --- ## Implementation Notes ### Priority Order (Recommended) 1. **Phase 1: Core Validation (Day 1)** - Create `ocr/validation.py` module - Implement validation rules (amount, TVA, payment, CUI, date) - Write unit tests - **Checkpoint:** All tests pass 2. **Phase 2: OCR Integration (Day 2 Morning)** - Add `preprocess_medium()` to image_preprocessor - Update `_merge_extractions()` with validation-aware logic - Remove/deprecate `preprocess_heavy()` - **Checkpoint:** Five-Holding receipt extracts correctly 3. **Phase 3: API Updates (Day 2 Afternoon)** - Update `ExtractionResult` dataclass with validation fields - Update API schemas (ocr.py, routers/ocr.py) - Add database migration - **Checkpoint:** API returns validation warnings 4. **Phase 4: Integration Testing (Day 3 Morning)** - Write integration tests - Test with real receipts (Five-Holding, OMV, Kaufland) - **Checkpoint:** All integration tests pass 5. **Phase 5: Frontend & Polish (Day 3 Afternoon)** - Update Vue components to display warnings - Add "Needs Review" filter - Manual testing - Bug fixes - **Checkpoint:** Production-ready ### Code Quality Standards - ✅ Type hints for all functions - ✅ Docstrings for all public methods - ✅ Unit test coverage >90% - ✅ Integration tests for critical paths - ✅ Print statements for debugging (will be converted to logging later) - ✅ Follow existing code patterns (SQLModel, Pydantic v2, FastAPI) ### Performance Considerations - **Validation overhead:** <10ms per receipt (negligible vs. OCR time) - **Medium preprocessing:** Similar speed to Heavy (~500ms) - **Database migration:** Non-blocking (adds NULL column) - **Frontend impact:** Minimal (only displays warnings) --- ## Related Documentation ### Project Context - **CLAUDE.md:** Data Entry module instructions - **docs/data-entry/DATA-ENTRY-MODULE.md:** Module architecture - **docs/ARCHITECTURE-DECISIONS.md:** Ultrathin monolith rationale ### Technical References - **Romanian CIF validation:** https://ro.wikipedia.org/wiki/Cod_de_identificare_fiscal%C4%83 - **OpenCV preprocessing:** https://docs.opencv.org/4.x/d7/d4d/tutorial_py_thresholding.html - **PaddleOCR docs:** https://github.com/PaddlePaddle/PaddleOCR ### Similar Features - **Payment methods extraction:** Already implemented in `ocr_extractor.py:1361` - **TVA entries extraction:** Already implemented in `ocr_extractor.py:820` - **Cross-validation logic:** Pattern from `_cross_validate_and_calculate_amount` (lines 468-557) --- ## Summary This specification provides a comprehensive solution to fix critical OCR data extraction issues in the Data Entry module. The multi-layer validation system ensures data integrity while maintaining user flexibility through non-blocking warnings. **Key Benefits:** - ✅ Prevents 10,000x magnitude errors (85.99 vs 859,762.16) - ✅ Validates cross-field dependencies (payment sum, TVA sum) - ✅ Improves CUI extraction with Mod 11 checksum - ✅ Replaces problematic Heavy OCR with Medium preprocessing - ✅ Non-blocking warnings preserve user workflow - ✅ Manual review flag helps supervisors prioritize **Next Steps:** 1. Review and approve specification 2. Create feature branch: `feature/bon-ocr-validation` 3. Implement Phase 1 (validation engine) 4. Continue with Phases 2-5 5. Deploy to staging for testing 6. Monitor production for 1 week before full rollout --- **Document Version:** 1.0 **Last Updated:** 2025-12-30 **Status:** Ready for Implementation **Estimated Completion:** 2026-01-02 (3 working days)