# OCR Implementation Plan - Data Entry App > **Context Handover Document** > Created: 2025-12-11 > Branch: `feature/data-entry-receipts` > Status: Ready for implementation ## Executive Summary Implementare OCR 100% local (fără costuri externe) pentru extragerea automată a datelor din bonuri fiscale/chitanțe românești. Soluția folosește PaddleOCR + regex extraction cu full-auto completion a formularului. **Cerințe utilizator:** - Open-source local, fără costuri externe - Full-auto: completează formularul automat - Input: doar imagini (JPG/PNG/PDF) - On-premise processing --- ## Stack Tehnic Recomandat | Component | Soluție | Justificare | |-----------|---------|-------------| | **OCR Engine** | PaddleOCR (primar) | 85-92% acuratețe, pip install simplu, CPU-friendly | | **Fallback OCR** | Tesseract + ron | Suport excelent diacritice românești | | **Extracție** | Regex/rules-based | Zero dependențe extra, rapid (<100ms), deterministic | | **Preprocessing** | OpenCV | Deskew, binarizare, denoise - esențial pentru bonuri termice | | **PDF → Image** | pdf2image + Poppler | Standard, fiabil | --- ## Fișiere de Creat ### Backend (Noi) ``` data-entry-app/backend/app/ ├── services/ │ ├── ocr_service.py # Orchestrare OCR (async) │ ├── ocr_engine.py # Wrapper PaddleOCR + Tesseract │ ├── ocr_extractor.py # Regex patterns pentru bonuri RO │ └── image_preprocessor.py # OpenCV pipeline ├── schemas/ │ └── ocr.py # ExtractionData, OCRResponse └── routers/ └── ocr.py # POST /api/ocr/extract ``` ### Frontend (Noi) ``` data-entry-app/frontend/src/components/ocr/ ├── OCRUploadZone.vue # Drag-drop + trigger OCR ├── OCRPreview.vue # Preview date extrase └── OCRConfidenceIndicator.vue # Indicator vizual încredere ``` ### Modificări la fișiere existente - `data-entry-app/backend/requirements.txt` - adaugă dependențe OCR - `data-entry-app/backend/app/main.py` - include OCR router - `data-entry-app/frontend/src/views/receipts/ReceiptCreateView.vue` - integrare OCR --- ## Câmpuri de Extras (din Receipt model) Câmpurile țintă pentru OCR extraction (vezi `data-entry-app/backend/app/db/models/receipt.py`): | Câmp | Tip | Acuratețe estimată | |------|-----|-------------------| | `receipt_type` | Enum: BON_FISCAL, CHITANTA | 95%+ | | `receipt_number` | String (max 50) | 80-85% | | `receipt_date` | Date | 85-90% | | `amount` | Decimal(2) | 90-95% | | `partner_name` | String (max 200) | 70-80% | | `cui` | String (fiscal code) | 85-90% | --- ## API Design ### `POST /api/ocr/extract` **Input**: `multipart/form-data` cu fișier (JPG/PNG/PDF, max 10MB) **Output**: ```json { "success": true, "message": "OCR processing successful", "data": { "receipt_type": "bon_fiscal", "receipt_number": "12345", "receipt_series": null, "receipt_date": "2024-01-15", "amount": 125.50, "partner_name": "MEGA IMAGE SRL", "cui": "12345678", "description": null, "confidence_amount": 0.95, "confidence_date": 0.90, "confidence_vendor": 0.75, "overall_confidence": 0.87, "raw_text": "BON FISCAL\nMEGA IMAGE SRL\n..." } } ``` ### `POST /api/ocr/extract-attachment/{attachment_id}` Re-procesează un attachment existent. --- ## Implementare Detaliată ### 1. Image Preprocessor (`image_preprocessor.py`) ```python """Image preprocessing for optimal OCR results.""" from pathlib import Path from typing import List import numpy as np import cv2 try: import pdf2image PDF_AVAILABLE = True except ImportError: PDF_AVAILABLE = False class ImagePreprocessor: """Preprocess receipt images for OCR.""" def load_image(self, path: Path) -> np.ndarray: """Load image from file.""" image = cv2.imread(str(path)) if image is None: raise ValueError(f"Could not load image: {path}") return image def pdf_to_images(self, path: Path, dpi: int = 300) -> List[np.ndarray]: """Convert PDF to images.""" if not PDF_AVAILABLE: raise RuntimeError("pdf2image not available") images = pdf2image.convert_from_path(str(path), dpi=dpi) return [np.array(img) for img in images] def preprocess(self, image: np.ndarray) -> np.ndarray: """ Apply preprocessing pipeline for thermal receipt images. Pipeline: 1. Convert to grayscale 2. Resize if too small (min 1000px width) 3. Deskew (straighten rotated text) 4. Denoise (Non-local means) 5. Adaptive thresholding (binarization) 6. Morphological close (connect broken chars) """ # 1. Grayscale if len(image.shape) == 3: gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) else: gray = image.copy() # 2. Resize if too small height, width = gray.shape if width < 1000: scale = 1000 / width gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_CUBIC) # 3. Deskew gray = self._deskew(gray) # 4. Denoise denoised = cv2.fastNlMeansDenoising(gray, h=10, templateWindowSize=7, searchWindowSize=21) # 5. Adaptive thresholding binary = cv2.adaptiveThreshold( denoised, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, blockSize=15, C=8 ) # 6. Morphological close kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2)) result = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel) return result def _deskew(self, image: np.ndarray) -> np.ndarray: """Correct image rotation/skew using Hough lines.""" edges = cv2.Canny(image, 50, 150, apertureSize=3) lines = cv2.HoughLinesP(edges, 1, np.pi/180, threshold=100, minLineLength=100, maxLineGap=10) if lines is None: return image angles = [] for line in lines: x1, y1, x2, y2 = line[0] angle = np.arctan2(y2 - y1, x2 - x1) * 180 / np.pi if abs(angle) < 45: angles.append(angle) if not angles: return image median_angle = np.median(angles) if abs(median_angle) < 0.5: return image h, w = image.shape[:2] center = (w // 2, h // 2) M = cv2.getRotationMatrix2D(center, median_angle, 1.0) return cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE) ``` ### 2. OCR Engine (`ocr_engine.py`) ```python """OCR engine wrapper for PaddleOCR and Tesseract.""" from dataclasses import dataclass from typing import List import numpy as np try: from paddleocr import PaddleOCR PADDLE_AVAILABLE = True except ImportError: PADDLE_AVAILABLE = False try: import pytesseract TESSERACT_AVAILABLE = True except ImportError: TESSERACT_AVAILABLE = False @dataclass class OCRResult: """Raw OCR result.""" text: str confidence: float boxes: List[dict] class OCREngine: """Unified OCR engine with fallback support.""" def __init__(self): self._paddle = None self._init_engines() def _init_engines(self): if PADDLE_AVAILABLE: self._paddle = PaddleOCR( use_angle_cls=True, lang='en', # Better for mixed text use_gpu=False, show_log=False, det_db_thresh=0.3, det_db_box_thresh=0.5, ) def recognize(self, image: np.ndarray) -> OCRResult: """Perform OCR on preprocessed image.""" if PADDLE_AVAILABLE and self._paddle: return self._paddle_recognize(image) elif TESSERACT_AVAILABLE: return self._tesseract_recognize(image) else: raise RuntimeError("No OCR engine available") def _paddle_recognize(self, image: np.ndarray) -> OCRResult: result = self._paddle.ocr(image, cls=True) if not result or not result[0]: return OCRResult(text="", confidence=0.0, boxes=[]) lines = [] total_conf = 0.0 boxes = [] for line in result[0]: box, (text, conf) = line lines.append(text) total_conf += conf boxes.append({'text': text, 'confidence': conf, 'box': box}) avg_conf = total_conf / len(result[0]) if result[0] else 0.0 return OCRResult(text='\n'.join(lines), confidence=avg_conf, boxes=boxes) def _tesseract_recognize(self, image: np.ndarray) -> OCRResult: config = '--psm 6 -l ron+eng' text = pytesseract.image_to_string(image, config=config) data = pytesseract.image_to_data(image, config=config, output_type=pytesseract.Output.DICT) confidences = [int(c) for c in data['conf'] if int(c) > 0] avg_conf = sum(confidences) / len(confidences) / 100 if confidences else 0.0 return OCRResult(text=text, confidence=avg_conf, boxes=[]) ``` ### 3. Receipt Extractor (`ocr_extractor.py`) ```python """Extract structured fields from OCR text.""" import re from datetime import date, datetime from decimal import Decimal, InvalidOperation from typing import Optional, Tuple from dataclasses import dataclass @dataclass class ExtractionResult: """Structured extraction result.""" receipt_type: str = 'bon_fiscal' receipt_number: Optional[str] = None receipt_series: Optional[str] = None receipt_date: Optional[date] = None amount: Optional[Decimal] = None partner_name: Optional[str] = None cui: Optional[str] = None description: Optional[str] = None confidence_amount: float = 0.0 confidence_date: float = 0.0 confidence_vendor: float = 0.0 raw_text: str = "" @property def overall_confidence(self) -> float: weights = {'amount': 0.4, 'date': 0.3, 'vendor': 0.3} return round( self.confidence_amount * weights['amount'] + self.confidence_date * weights['date'] + self.confidence_vendor * weights['vendor'], 2 ) class ReceiptExtractor: """Extract receipt fields using pattern matching.""" TOTAL_PATTERNS = [ (r'TOTAL\s*:?\s*([\d\s.,]+)\s*(?:RON|LEI)?', 0.95), (r'TOTAL\s+(?:RON|LEI)\s*([\d\s.,]+)', 0.95), (r'DE\s+PLATA\s*:?\s*([\d\s.,]+)', 0.90), (r'SUMA\s*:?\s*([\d\s.,]+)', 0.85), ] DATE_PATTERNS = [ (r'DATA\s*:?\s*(\d{2}[./]\d{2}[./]\d{4})', 0.95), (r'(\d{2}[./]\d{2}[./]\d{4})\s+\d{2}:\d{2}', 0.90), (r'(\d{2}[./]\d{2}[./]\d{4})', 0.80), ] NUMBER_PATTERNS = [ (r'NR\.?\s*BON\s*:?\s*(\d+)', 0.95), (r'BON\s+(?:FISCAL\s+)?NR\.?\s*:?\s*(\d+)', 0.95), (r'NR\.?\s*:?\s*(\d{4,})', 0.70), ] CUI_PATTERNS = [ (r'C\.?U\.?I\.?\s*:?\s*(?:RO)?(\d{6,10})', 0.95), (r'C\.?I\.?F\.?\s*:?\s*(?:RO)?(\d{6,10})', 0.95), ] def extract(self, text: str) -> ExtractionResult: result = ExtractionResult() text_upper = text.upper() result.amount, result.confidence_amount = self._extract_amount(text_upper) result.receipt_date, result.confidence_date = self._extract_date(text_upper) result.receipt_number, _ = self._extract_number(text_upper) result.partner_name, result.confidence_vendor = self._extract_vendor(text) result.cui, _ = self._extract_cui(text_upper) return result def _extract_amount(self, text: str) -> Tuple[Optional[Decimal], float]: for pattern, confidence in self.TOTAL_PATTERNS: match = re.search(pattern, text, re.IGNORECASE | re.MULTILINE) if match: try: amount_str = re.sub(r'[^\d.,]', '', match.group(1)) amount_str = amount_str.replace(',', '.') parts = amount_str.split('.') if len(parts) > 2: amount_str = ''.join(parts[:-1]) + '.' + parts[-1] amount = Decimal(amount_str) if amount > 0: return amount, confidence except (InvalidOperation, ValueError): continue return None, 0.0 def _extract_date(self, text: str) -> Tuple[Optional[date], float]: for pattern, confidence in self.DATE_PATTERNS: match = re.search(pattern, text) if match: try: date_str = match.group(1).replace('/', '.') parsed = datetime.strptime(date_str, '%d.%m.%Y').date() today = date.today() if parsed <= today and parsed.year >= 2020: return parsed, confidence except ValueError: continue return None, 0.0 def _extract_number(self, text: str) -> Tuple[Optional[str], float]: for pattern, confidence in self.NUMBER_PATTERNS: match = re.search(pattern, text, re.IGNORECASE) if match: return match.group(1), confidence return None, 0.0 def _extract_vendor(self, text: str) -> Tuple[Optional[str], float]: lines = text.split('\n') skip_keywords = ['BON', 'FISCAL', 'TOTAL', 'DATA', 'NR', 'ORA'] for i, line in enumerate(lines[:5]): line = line.strip() if not line or re.match(r'^[\d.,\s]+$', line): continue if any(kw in line.upper() for kw in skip_keywords): continue vendor = re.sub(r'[^\w\s.,&-]', '', line).strip() if len(vendor) >= 3: return vendor, 0.7 - (i * 0.1) return None, 0.0 def _extract_cui(self, text: str) -> Tuple[Optional[str], float]: for pattern, confidence in self.CUI_PATTERNS: match = re.search(pattern, text, re.IGNORECASE) if match: cui = match.group(1) if 6 <= len(cui) <= 10: return cui, confidence return None, 0.0 ``` ### 4. OCR Service (`ocr_service.py`) ```python """Main OCR service coordinating preprocessing, recognition, and extraction.""" from typing import Optional, Tuple from pathlib import Path import asyncio from concurrent.futures import ThreadPoolExecutor from app.services.ocr_engine import OCREngine from app.services.ocr_extractor import ReceiptExtractor, ExtractionResult from app.services.image_preprocessor import ImagePreprocessor class OCRService: """Service for OCR processing of receipt images.""" _executor = ThreadPoolExecutor(max_workers=2) def __init__(self): self.preprocessor = ImagePreprocessor() self.ocr_engine = OCREngine() self.extractor = ReceiptExtractor() async def process_image( self, image_path: Path, mime_type: str ) -> Tuple[bool, str, Optional[ExtractionResult]]: """Process receipt image and extract structured data.""" try: result = await asyncio.get_event_loop().run_in_executor( self._executor, self._process_sync, image_path, mime_type ) return result except Exception as e: return False, f"OCR processing failed: {str(e)}", None def _process_sync( self, image_path: Path, mime_type: str ) -> Tuple[bool, str, Optional[ExtractionResult]]: """Synchronous processing (runs in thread pool).""" # Handle PDF if mime_type == 'application/pdf': images = self.preprocessor.pdf_to_images(image_path) if not images: return False, "Failed to extract images from PDF", None image = images[0] # First page only else: image = self.preprocessor.load_image(image_path) # Preprocess processed = self.preprocessor.preprocess(image) # OCR ocr_result = self.ocr_engine.recognize(processed) if not ocr_result.text: return False, "No text detected in image", None # Extract fields extraction = self.extractor.extract(ocr_result.text) extraction.raw_text = ocr_result.text # Detect receipt type text_upper = ocr_result.text.upper() if 'CHITANTA' in text_upper or 'CHITANȚĂ' in text_upper: extraction.receipt_type = 'chitanta' else: extraction.receipt_type = 'bon_fiscal' return True, "OCR processing successful", extraction ``` ### 5. Schemas (`schemas/ocr.py`) ```python """Pydantic schemas for OCR API.""" from datetime import date from decimal import Decimal from typing import Optional from pydantic import BaseModel, Field class ExtractionData(BaseModel): """Extracted receipt data.""" receipt_type: str = Field(default='bon_fiscal') receipt_number: Optional[str] = None receipt_series: Optional[str] = None receipt_date: Optional[date] = None amount: Optional[Decimal] = None partner_name: Optional[str] = None cui: Optional[str] = None description: Optional[str] = None confidence_amount: float = Field(default=0.0, ge=0, le=1) confidence_date: float = Field(default=0.0, ge=0, le=1) confidence_vendor: float = Field(default=0.0, ge=0, le=1) overall_confidence: float = Field(default=0.0, ge=0, le=1) raw_text: str = Field(default="") class OCRResponse(BaseModel): """OCR API response.""" success: bool message: str data: Optional[ExtractionData] = None ``` ### 6. Router (`routers/ocr.py`) ```python """OCR API endpoints.""" from pathlib import Path import tempfile import os from fastapi import APIRouter, HTTPException, UploadFile, File, Depends from sqlalchemy.ext.asyncio import AsyncSession from app.db.database import get_session from app.db.crud.attachment import AttachmentCRUD from app.services.ocr_service import OCRService from app.schemas.ocr import OCRResponse router = APIRouter() ocr_service = OCRService() @router.post("/extract", response_model=OCRResponse) async def extract_from_image(file: UploadFile = File(...)): """Extract receipt data from uploaded image.""" allowed_types = ['image/jpeg', 'image/png', 'application/pdf'] if file.content_type not in allowed_types: raise HTTPException(400, f"File type not supported: {file.content_type}") suffix = Path(file.filename).suffix if file.filename else '.jpg' with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp: content = await file.read() tmp.write(content) tmp_path = Path(tmp.name) try: success, message, result = await ocr_service.process_image( tmp_path, file.content_type ) if not success: raise HTTPException(422, message) return OCRResponse(success=True, message=message, data=result) finally: os.unlink(tmp_path) @router.post("/extract-attachment/{attachment_id}", response_model=OCRResponse) async def extract_from_attachment( attachment_id: int, session: AsyncSession = Depends(get_session), ): """Extract receipt data from existing attachment.""" attachment = await AttachmentCRUD.get_by_id(session, attachment_id) if not attachment: raise HTTPException(404, "Attachment not found") file_path = AttachmentCRUD.get_file_path(attachment) if not file_path.exists(): raise HTTPException(404, "File not found on disk") success, message, result = await ocr_service.process_image( file_path, attachment.mime_type ) if not success: raise HTTPException(422, message) return OCRResponse(success=True, message=message, data=result) ``` --- ## Dependențe ### Python (`requirements.txt` - adaugă) ``` # OCR Dependencies paddleocr>=2.7.0 paddlepaddle>=2.5.0 opencv-python>=4.8.0 pytesseract>=0.3.10 pdf2image>=1.16.0 ``` ### Sistem (Linux/Docker) ```bash apt-get install -y \ tesseract-ocr \ tesseract-ocr-ron \ tesseract-ocr-eng \ poppler-utils \ libgl1-mesa-glx \ libglib2.0-0 ``` --- ## User Flow ``` 1. User deschide "Bon Fiscal Nou" 2. User trage/selectează poza bonului în OCRUploadZone 3. [Spinner 2-3 sec] "Se procesează imaginea..." 4. Apare OCRPreview cu date extrase + confidence indicators 5. User click "Aplică datele" sau corectează manual 6. Formularul se completează automat 7. User selectează tip cheltuială, casa de marcat 8. User salvează draft sau trimite pentru aprobare ``` --- ## Pași Implementare ### Pasul 1: Dependențe și setup - [ ] Adaugă dependențe în `requirements.txt` - [ ] Instalează pachete sistem (tesseract, poppler) - [ ] Testează import PaddleOCR ### Pasul 2: Backend services - [ ] Creează `image_preprocessor.py` - [ ] Creează `ocr_engine.py` - [ ] Creează `ocr_extractor.py` - [ ] Creează `ocr_service.py` - [ ] Creează `schemas/ocr.py` ### Pasul 3: API endpoint - [ ] Creează `routers/ocr.py` - [ ] Include router în `main.py` - [ ] Testează endpoint ### Pasul 4: Frontend components - [ ] Creează `OCRUploadZone.vue` - [ ] Creează `OCRPreview.vue` - [ ] Creează `OCRConfidenceIndicator.vue` ### Pasul 5: Integrare - [ ] Modifică `ReceiptCreateView.vue` - [ ] Adaugă auto-fill din OCR result - [ ] Adaugă feedback vizual ### Pasul 6: Testing - [ ] Testează pe sample bonuri românești - [ ] Ajustează regex patterns - [ ] Optimizează preprocessing --- ## Referințe Fișiere Existente - `data-entry-app/backend/app/services/receipt_service.py` - Pattern servicii - `data-entry-app/backend/app/db/crud/attachment.py` - File handling - `data-entry-app/backend/app/schemas/receipt.py` - Schema patterns - `data-entry-app/backend/app/db/models/receipt.py` - Receipt model - `data-entry-app/frontend/src/views/receipts/ReceiptCreateView.vue` - View de modificat - `data-entry-app/CLAUDE.md` - Instrucțiuni specifice data-entry