docs

2025-12-13 01:56:03 +02:00
parent 9f06482681
commit 682a4b64b9
7 changed files with 2824 additions and 0 deletions
--- a/data-entry-app/docs/IMPLEMENTATION_PLAN_OCR.md
+++ b/data-entry-app/docs/IMPLEMENTATION_PLAN_OCR.md
@@ -0,0 +1,254 @@
+# Plan: OCR Inteligent cu Early Exit
+
+> **Context Handover Document** - Plan de implementare pentru următoarea sesiune
+
+## Obiectiv
+Optimizare proces OCR - dacă PaddleOCR pe light preprocessing dă rezultate bune, să NU mai ruleze heavy preprocessing și Tesseract.
+
+---
+
+## Criterii Early Exit (TOATE trebuie îndeplinite)
+
+**Continuă cu alte încercări DACĂ:**
+- Confidență < **85%** SAU
+- Lipsește ORICARE din câmpurile critice:
+  - ✗ Număr bon (`receipt_number`)
+  - ✗ Dată (`receipt_date`)
+  - ✗ Valoare totală (`amount`)
+  - ✗ Valoare TVA (`tva_total` sau `tva_entries`)
+  - ✗ Cod fiscal (`cui`)
+
+**Early exit DOAR când:**
+- Confidență >= 85% **ȘI**
+- TOATE 5 câmpurile sunt extrase
+
+---
+
+## Flow Propus: Adaptive OCR Pipeline
+
+```
+1. PaddleOCR + Light Preprocessing (cel mai rapid, cel mai bun pentru PDF-uri clare)
+   ↓
+   Verifică: conf >= 85% AND toate 5 câmpurile extrase?
+   ├─ DA → STOP, returnează rezultat
+   └─ NU → continuă la pasul 2
+
+2. PaddleOCR + Heavy Preprocessing (pentru bonuri termice șterse)
+   ↓
+   Combină cu rezultatul anterior (merge)
+   Verifică: toate câmpurile extrase acum?
+   ├─ DA → STOP
+   └─ NU → continuă la pasul 3
+
+3. Tesseract + Light (fallback pentru cazuri dificile)
+   ↓
+   Combină toate rezultatele
+   Returnează cel mai bun rezultat combinat
+```
+
+---
+
+## Beneficii Estimate
+
+| Tip document | OCR calls | Timp estimat |
+|--------------|-----------|--------------|
+| PDF clar (Kineterra) | 1 | ~2-3 sec |
+| PDF mediu | 2 | ~5 sec |
+| Bon termic șters | 3 | ~8-10 sec |
+
+**Comparație cu acum:** Totdeauna 4 calls → maxim 3, de obicei 1-2
+
+---
+
+## Fișier de Modificat
+
+**`data-entry-app/backend/app/services/ocr_service.py`**
+
+### Înlocuire completă `_process_sync()`:
+
+```python
+def _process_sync(
+    self,
+    image_path: Path,
+    mime_type: str
+) -> Tuple[bool, str, Optional[ExtractionResult]]:
+    """Synchronous processing with ADAPTIVE OCR pipeline."""
+
+    logger.info(f"[OCR Service] Starting processing: {image_path}, mime: {mime_type}")
+
+    # Load image
+    if mime_type == 'application/pdf':
+        try:
+            images = self.preprocessor.pdf_to_images(image_path)
+            if not images:
+                return False, "Failed to extract images from PDF", None
+            image = images[0]
+        except RuntimeError as e:
+            return False, str(e), None
+    else:
+        try:
+            image = self.preprocessor.load_image(image_path)
+        except ValueError as e:
+            return False, str(e), None
+
+    raw_texts = []
+    extraction = None
+
+    # ══════════════════════════════════════════════════════════════
+    # STEP 1: PaddleOCR + Light (fastest, best for clear PDFs)
+    # ══════════════════════════════════════════════════════════════
+    logger.info("[OCR] Step 1: PaddleOCR + Light preprocessing")
+    light_img = self.preprocessor.preprocess_light(image)
+
+    try:
+        paddle_light = self.ocr_engine._paddle_recognize(light_img)
+        if paddle_light and paddle_light.text:
+            extraction = self.extractor.extract(paddle_light.text)
+            extraction.ocr_engine = "paddle-light"
+            raw_texts.append(f"═══ PaddleOCR (light, conf: {paddle_light.confidence:.0%}) ═══\n{paddle_light.text}")
+
+            # Early exit if complete
+            if self._is_extraction_complete(extraction):
+                extraction.raw_text = "\n\n".join(raw_texts)
+                logger.info("[OCR] ✓ Early exit: complete extraction from paddle-light")
+                return True, "OCR complete (fast mode)", extraction
+    except Exception as e:
+        logger.warning(f"[OCR] PaddleOCR light failed: {e}")
+        extraction = ExtractionResult()
+
+    # ══════════════════════════════════════════════════════════════
+    # STEP 2: PaddleOCR + Heavy (for faded thermal receipts)
+    # ══════════════════════════════════════════════════════════════
+    logger.info("[OCR] Step 2: PaddleOCR + Heavy preprocessing")
+    heavy_img = self.preprocessor.preprocess_heavy(image)
+
+    try:
+        paddle_heavy = self.ocr_engine._paddle_recognize(heavy_img)
+        if paddle_heavy and paddle_heavy.text:
+            extraction_heavy = self.extractor.extract(paddle_heavy.text)
+            extraction_heavy.ocr_engine = "paddle-heavy"
+            raw_texts.append(f"═══ PaddleOCR (heavy, conf: {paddle_heavy.confidence:.0%}) ═══\n{paddle_heavy.text}")
+
+            # Merge with previous
+            extraction = self._merge_extractions(extraction, extraction_heavy)
+
+            if self._is_extraction_complete(extraction):
+                extraction.raw_text = "\n\n".join(raw_texts)
+                extraction.ocr_engine = "paddle-adaptive"
+                logger.info("[OCR] ✓ Early exit: complete extraction after paddle-heavy")
+                return True, "OCR complete (paddle dual)", extraction
+    except Exception as e:
+        logger.warning(f"[OCR] PaddleOCR heavy failed: {e}")
+
+    # ══════════════════════════════════════════════════════════════
+    # STEP 3: Tesseract fallback
+    # ══════════════════════════════════════════════════════════════
+    logger.info("[OCR] Step 3: Tesseract fallback")
+
+    try:
+        tesseract_result = self.ocr_engine._tesseract_recognize(light_img)
+        if tesseract_result and tesseract_result.text:
+            extraction_tess = self.extractor.extract(tesseract_result.text)
+            extraction_tess.ocr_engine = "tesseract"
+            raw_texts.append(f"═══ Tesseract (conf: {tesseract_result.confidence:.0%}) ═══\n{tesseract_result.text}")
+
+            extraction = self._merge_extractions(extraction, extraction_tess)
+    except Exception as e:
+        logger.warning(f"[OCR] Tesseract failed: {e}")
+
+    # Final result
+    if extraction is None:
+        return False, "No text detected", None
+
+    extraction.raw_text = "\n\n".join(raw_texts)
+    extraction.ocr_engine = "adaptive-full"
+
+    # Build result message
+    fields_found = []
+    if extraction.amount: fields_found.append("amount")
+    if extraction.receipt_date: fields_found.append("date")
+    if extraction.receipt_number: fields_found.append("number")
+    if extraction.cui: fields_found.append("CUI")
+    if extraction.tva_total or extraction.tva_entries: fields_found.append("TVA")
+
+    message = f"OCR complete (full pipeline). Found: {', '.join(fields_found) or 'no fields'}"
+    logger.info(f"[OCR] Final result: {message}")
+
+    return True, message, extraction
+```
+
+### Adăugare metodă `_is_extraction_complete()`:
+
+```python
+def _is_extraction_complete(self, ext: ExtractionResult, min_confidence: float = 0.85) -> bool:
+    """
+    Check if extraction has ALL required fields to skip further processing.
+
+    Required for early exit (ALL must be true):
+    - Overall confidence >= 85%
+    - ALL 5 critical fields present: number, date, amount, TVA, CUI
+    """
+    # Must have high confidence
+    if ext.overall_confidence < min_confidence:
+        logger.info(f"[OCR] Confidence {ext.overall_confidence:.0%} < {min_confidence:.0%} - continuing")
+        return False
+
+    # Check all required fields
+    has_number = bool(ext.receipt_number)
+    has_date = bool(ext.receipt_date)
+    has_amount = bool(ext.amount)
+    has_tva = bool(ext.tva_total) or bool(ext.tva_entries)
+    has_cui = bool(ext.cui)
+
+    missing = []
+    if not has_number: missing.append("number")
+    if not has_date: missing.append("date")
+    if not has_amount: missing.append("amount")
+    if not has_tva: missing.append("TVA")
+    if not has_cui: missing.append("CUI")
+
+    if missing:
+        logger.info(f"[OCR] Missing: {', '.join(missing)} - continuing")
+        return False
+
+    logger.info(f"[OCR] ✓ All 5 fields found with {ext.overall_confidence:.0%} confidence")
+    return True
+```
+
+---
+
+## Cod de Șters
+
+După implementare, poți șterge:
+- `_merge_all_extractions()` - înlocuită de flow secvențial
+- `_format_dual_raw_text()` - nefolosită
+- Bucla `for i, processed in enumerate(variants):` - înlocuită complet
+
+---
+
+## Context: Rezultate OCR Kineterra
+
+Din testele anterioare, **PaddleOCR + Light** a dat cele mai bune rezultate:
+
+| Variantă | Conf | CUI | Adresa |
+|----------|------|-----|--------|
+| **PaddleOCR Light** | **83%** | **31180432** ✓ | MUN.CONSTANTA ✓ |
+| PaddleOCR Heavy | 83% | 31189432 ✗ | CONSTANTN ✗ |
+| Tesseract Light | 50% | 31100400 ✗ | corupt |
+| Tesseract Heavy | 42% | - | corupt |
+
+---
+
+## Testare
+
+După implementare, testează cu toate PDF-urile:
+
+1. **`abonament kineterra.pdf`** - ar trebui să facă early exit la Step 1
+2. **`benzina 27 octombrie.pdf`** - verifică extracție completă
+3. **`igiena 11 octombrie.pdf`** - verifică extracție completă
+4. **`benzina 14 august.pdf`** - verifică extracție completă
+
+---
+
+*Generat: 2024-12-12*
+*Pentru continuare în următoarea sesiune Claude*