feat: Add OCR integration for automatic receipt data extraction

Implement Tesseract-based OCR to automatically extract vendor name, date, total amount, and VAT from uploaded receipt images/PDFs, reducing manual data entry and improving accuracy. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 11:48:29 +02:00
parent 5960154094
commit 41ae97180e
16 changed files with 2773 additions and 32 deletions
--- a/docs/OCR_IMPLEMENTATION_PLAN.md
+++ b/docs/OCR_IMPLEMENTATION_PLAN.md
@@ -0,0 +1,717 @@
+# OCR Implementation Plan - Data Entry App
+
+> **Context Handover Document**
+> Created: 2025-12-11
+> Branch: `feature/data-entry-receipts`
+> Status: Ready for implementation
+
+## Executive Summary
+
+Implementare OCR 100% local (fără costuri externe) pentru extragerea automată a datelor din bonuri fiscale/chitanțe românești. Soluția folosește PaddleOCR + regex extraction cu full-auto completion a formularului.
+
+**Cerințe utilizator:**
+- Open-source local, fără costuri externe
+- Full-auto: completează formularul automat
+- Input: doar imagini (JPG/PNG/PDF)
+- On-premise processing
+
+---
+
+## Stack Tehnic Recomandat
+
+| Component | Soluție | Justificare |
+|-----------|---------|-------------|
+| **OCR Engine** | PaddleOCR (primar) | 85-92% acuratețe, pip install simplu, CPU-friendly |
+| **Fallback OCR** | Tesseract + ron | Suport excelent diacritice românești |
+| **Extracție** | Regex/rules-based | Zero dependențe extra, rapid (<100ms), deterministic |
+| **Preprocessing** | OpenCV | Deskew, binarizare, denoise - esențial pentru bonuri termice |
+| **PDF → Image** | pdf2image + Poppler | Standard, fiabil |
+
+---
+
+## Fișiere de Creat
+
+### Backend (Noi)
+```
+data-entry-app/backend/app/
+├── services/
+│   ├── ocr_service.py          # Orchestrare OCR (async)
+│   ├── ocr_engine.py           # Wrapper PaddleOCR + Tesseract
+│   ├── ocr_extractor.py        # Regex patterns pentru bonuri RO
+│   └── image_preprocessor.py   # OpenCV pipeline
+├── schemas/
+│   └── ocr.py                  # ExtractionData, OCRResponse
+└── routers/
+    └── ocr.py                  # POST /api/ocr/extract
+```
+
+### Frontend (Noi)
+```
+data-entry-app/frontend/src/components/ocr/
+├── OCRUploadZone.vue           # Drag-drop + trigger OCR
+├── OCRPreview.vue              # Preview date extrase
+└── OCRConfidenceIndicator.vue  # Indicator vizual încredere
+```
+
+### Modificări la fișiere existente
+- `data-entry-app/backend/requirements.txt` - adaugă dependențe OCR
+- `data-entry-app/backend/app/main.py` - include OCR router
+- `data-entry-app/frontend/src/views/receipts/ReceiptCreateView.vue` - integrare OCR
+
+---
+
+## Câmpuri de Extras (din Receipt model)
+
+Câmpurile țintă pentru OCR extraction (vezi `data-entry-app/backend/app/db/models/receipt.py`):
+
+| Câmp | Tip | Acuratețe estimată |
+|------|-----|-------------------|
+| `receipt_type` | Enum: BON_FISCAL, CHITANTA | 95%+ |
+| `receipt_number` | String (max 50) | 80-85% |
+| `receipt_date` | Date | 85-90% |
+| `amount` | Decimal(2) | 90-95% |
+| `partner_name` | String (max 200) | 70-80% |
+| `cui` | String (fiscal code) | 85-90% |
+
+---
+
+## API Design
+
+### `POST /api/ocr/extract`
+
+**Input**: `multipart/form-data` cu fișier (JPG/PNG/PDF, max 10MB)
+
+**Output**:
+```json
+{
+  "success": true,
+  "message": "OCR processing successful",
+  "data": {
+    "receipt_type": "bon_fiscal",
+    "receipt_number": "12345",
+    "receipt_series": null,
+    "receipt_date": "2024-01-15",
+    "amount": 125.50,
+    "partner_name": "MEGA IMAGE SRL",
+    "cui": "12345678",
+    "description": null,
+    "confidence_amount": 0.95,
+    "confidence_date": 0.90,
+    "confidence_vendor": 0.75,
+    "overall_confidence": 0.87,
+    "raw_text": "BON FISCAL\nMEGA IMAGE SRL\n..."
+  }
+}
+```
+
+### `POST /api/ocr/extract-attachment/{attachment_id}`
+Re-procesează un attachment existent.
+
+---
+
+## Implementare Detaliată
+
+### 1. Image Preprocessor (`image_preprocessor.py`)
+
+```python
+"""Image preprocessing for optimal OCR results."""
+from pathlib import Path
+from typing import List
+import numpy as np
+import cv2
+
+try:
+    import pdf2image
+    PDF_AVAILABLE = True
+except ImportError:
+    PDF_AVAILABLE = False
+
+
+class ImagePreprocessor:
+    """Preprocess receipt images for OCR."""
+
+    def load_image(self, path: Path) -> np.ndarray:
+        """Load image from file."""
+        image = cv2.imread(str(path))
+        if image is None:
+            raise ValueError(f"Could not load image: {path}")
+        return image
+
+    def pdf_to_images(self, path: Path, dpi: int = 300) -> List[np.ndarray]:
+        """Convert PDF to images."""
+        if not PDF_AVAILABLE:
+            raise RuntimeError("pdf2image not available")
+        images = pdf2image.convert_from_path(str(path), dpi=dpi)
+        return [np.array(img) for img in images]
+
+    def preprocess(self, image: np.ndarray) -> np.ndarray:
+        """
+        Apply preprocessing pipeline for thermal receipt images.
+
+        Pipeline:
+        1. Convert to grayscale
+        2. Resize if too small (min 1000px width)
+        3. Deskew (straighten rotated text)
+        4. Denoise (Non-local means)
+        5. Adaptive thresholding (binarization)
+        6. Morphological close (connect broken chars)
+        """
+        # 1. Grayscale
+        if len(image.shape) == 3:
+            gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
+        else:
+            gray = image.copy()
+
+        # 2. Resize if too small
+        height, width = gray.shape
+        if width < 1000:
+            scale = 1000 / width
+            gray = cv2.resize(gray, None, fx=scale, fy=scale,
+                            interpolation=cv2.INTER_CUBIC)
+
+        # 3. Deskew
+        gray = self._deskew(gray)
+
+        # 4. Denoise
+        denoised = cv2.fastNlMeansDenoising(gray, h=10,
+                                            templateWindowSize=7,
+                                            searchWindowSize=21)
+
+        # 5. Adaptive thresholding
+        binary = cv2.adaptiveThreshold(
+            denoised, 255,
+            cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
+            cv2.THRESH_BINARY,
+            blockSize=15, C=8
+        )
+
+        # 6. Morphological close
+        kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
+        result = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)
+
+        return result
+
+    def _deskew(self, image: np.ndarray) -> np.ndarray:
+        """Correct image rotation/skew using Hough lines."""
+        edges = cv2.Canny(image, 50, 150, apertureSize=3)
+        lines = cv2.HoughLinesP(edges, 1, np.pi/180,
+                               threshold=100, minLineLength=100, maxLineGap=10)
+
+        if lines is None:
+            return image
+
+        angles = []
+        for line in lines:
+            x1, y1, x2, y2 = line[0]
+            angle = np.arctan2(y2 - y1, x2 - x1) * 180 / np.pi
+            if abs(angle) < 45:
+                angles.append(angle)
+
+        if not angles:
+            return image
+
+        median_angle = np.median(angles)
+        if abs(median_angle) < 0.5:
+            return image
+
+        h, w = image.shape[:2]
+        center = (w // 2, h // 2)
+        M = cv2.getRotationMatrix2D(center, median_angle, 1.0)
+        return cv2.warpAffine(image, M, (w, h),
+                             flags=cv2.INTER_CUBIC,
+                             borderMode=cv2.BORDER_REPLICATE)
+```
+
+### 2. OCR Engine (`ocr_engine.py`)
+
+```python
+"""OCR engine wrapper for PaddleOCR and Tesseract."""
+from dataclasses import dataclass
+from typing import List
+import numpy as np
+
+try:
+    from paddleocr import PaddleOCR
+    PADDLE_AVAILABLE = True
+except ImportError:
+    PADDLE_AVAILABLE = False
+
+try:
+    import pytesseract
+    TESSERACT_AVAILABLE = True
+except ImportError:
+    TESSERACT_AVAILABLE = False
+
+
+@dataclass
+class OCRResult:
+    """Raw OCR result."""
+    text: str
+    confidence: float
+    boxes: List[dict]
+
+
+class OCREngine:
+    """Unified OCR engine with fallback support."""
+
+    def __init__(self):
+        self._paddle = None
+        self._init_engines()
+
+    def _init_engines(self):
+        if PADDLE_AVAILABLE:
+            self._paddle = PaddleOCR(
+                use_angle_cls=True,
+                lang='en',  # Better for mixed text
+                use_gpu=False,
+                show_log=False,
+                det_db_thresh=0.3,
+                det_db_box_thresh=0.5,
+            )
+
+    def recognize(self, image: np.ndarray) -> OCRResult:
+        """Perform OCR on preprocessed image."""
+        if PADDLE_AVAILABLE and self._paddle:
+            return self._paddle_recognize(image)
+        elif TESSERACT_AVAILABLE:
+            return self._tesseract_recognize(image)
+        else:
+            raise RuntimeError("No OCR engine available")
+
+    def _paddle_recognize(self, image: np.ndarray) -> OCRResult:
+        result = self._paddle.ocr(image, cls=True)
+
+        if not result or not result[0]:
+            return OCRResult(text="", confidence=0.0, boxes=[])
+
+        lines = []
+        total_conf = 0.0
+        boxes = []
+
+        for line in result[0]:
+            box, (text, conf) = line
+            lines.append(text)
+            total_conf += conf
+            boxes.append({'text': text, 'confidence': conf, 'box': box})
+
+        avg_conf = total_conf / len(result[0]) if result[0] else 0.0
+        return OCRResult(text='\n'.join(lines), confidence=avg_conf, boxes=boxes)
+
+    def _tesseract_recognize(self, image: np.ndarray) -> OCRResult:
+        config = '--psm 6 -l ron+eng'
+        text = pytesseract.image_to_string(image, config=config)
+        data = pytesseract.image_to_data(image, config=config,
+                                         output_type=pytesseract.Output.DICT)
+        confidences = [int(c) for c in data['conf'] if int(c) > 0]
+        avg_conf = sum(confidences) / len(confidences) / 100 if confidences else 0.0
+        return OCRResult(text=text, confidence=avg_conf, boxes=[])
+```
+
+### 3. Receipt Extractor (`ocr_extractor.py`)
+
+```python
+"""Extract structured fields from OCR text."""
+import re
+from datetime import date, datetime
+from decimal import Decimal, InvalidOperation
+from typing import Optional, Tuple
+from dataclasses import dataclass
+
+
+@dataclass
+class ExtractionResult:
+    """Structured extraction result."""
+    receipt_type: str = 'bon_fiscal'
+    receipt_number: Optional[str] = None
+    receipt_series: Optional[str] = None
+    receipt_date: Optional[date] = None
+    amount: Optional[Decimal] = None
+    partner_name: Optional[str] = None
+    cui: Optional[str] = None
+    description: Optional[str] = None
+
+    confidence_amount: float = 0.0
+    confidence_date: float = 0.0
+    confidence_vendor: float = 0.0
+    raw_text: str = ""
+
+    @property
+    def overall_confidence(self) -> float:
+        weights = {'amount': 0.4, 'date': 0.3, 'vendor': 0.3}
+        return round(
+            self.confidence_amount * weights['amount'] +
+            self.confidence_date * weights['date'] +
+            self.confidence_vendor * weights['vendor'], 2
+        )
+
+
+class ReceiptExtractor:
+    """Extract receipt fields using pattern matching."""
+
+    TOTAL_PATTERNS = [
+        (r'TOTAL\s*:?\s*([\d\s.,]+)\s*(?:RON|LEI)?', 0.95),
+        (r'TOTAL\s+(?:RON|LEI)\s*([\d\s.,]+)', 0.95),
+        (r'DE\s+PLATA\s*:?\s*([\d\s.,]+)', 0.90),
+        (r'SUMA\s*:?\s*([\d\s.,]+)', 0.85),
+    ]
+
+    DATE_PATTERNS = [
+        (r'DATA\s*:?\s*(\d{2}[./]\d{2}[./]\d{4})', 0.95),
+        (r'(\d{2}[./]\d{2}[./]\d{4})\s+\d{2}:\d{2}', 0.90),
+        (r'(\d{2}[./]\d{2}[./]\d{4})', 0.80),
+    ]
+
+    NUMBER_PATTERNS = [
+        (r'NR\.?\s*BON\s*:?\s*(\d+)', 0.95),
+        (r'BON\s+(?:FISCAL\s+)?NR\.?\s*:?\s*(\d+)', 0.95),
+        (r'NR\.?\s*:?\s*(\d{4,})', 0.70),
+    ]
+
+    CUI_PATTERNS = [
+        (r'C\.?U\.?I\.?\s*:?\s*(?:RO)?(\d{6,10})', 0.95),
+        (r'C\.?I\.?F\.?\s*:?\s*(?:RO)?(\d{6,10})', 0.95),
+    ]
+
+    def extract(self, text: str) -> ExtractionResult:
+        result = ExtractionResult()
+        text_upper = text.upper()
+
+        result.amount, result.confidence_amount = self._extract_amount(text_upper)
+        result.receipt_date, result.confidence_date = self._extract_date(text_upper)
+        result.receipt_number, _ = self._extract_number(text_upper)
+        result.partner_name, result.confidence_vendor = self._extract_vendor(text)
+        result.cui, _ = self._extract_cui(text_upper)
+
+        return result
+
+    def _extract_amount(self, text: str) -> Tuple[Optional[Decimal], float]:
+        for pattern, confidence in self.TOTAL_PATTERNS:
+            match = re.search(pattern, text, re.IGNORECASE | re.MULTILINE)
+            if match:
+                try:
+                    amount_str = re.sub(r'[^\d.,]', '', match.group(1))
+                    amount_str = amount_str.replace(',', '.')
+                    parts = amount_str.split('.')
+                    if len(parts) > 2:
+                        amount_str = ''.join(parts[:-1]) + '.' + parts[-1]
+                    amount = Decimal(amount_str)
+                    if amount > 0:
+                        return amount, confidence
+                except (InvalidOperation, ValueError):
+                    continue
+        return None, 0.0
+
+    def _extract_date(self, text: str) -> Tuple[Optional[date], float]:
+        for pattern, confidence in self.DATE_PATTERNS:
+            match = re.search(pattern, text)
+            if match:
+                try:
+                    date_str = match.group(1).replace('/', '.')
+                    parsed = datetime.strptime(date_str, '%d.%m.%Y').date()
+                    today = date.today()
+                    if parsed <= today and parsed.year >= 2020:
+                        return parsed, confidence
+                except ValueError:
+                    continue
+        return None, 0.0
+
+    def _extract_number(self, text: str) -> Tuple[Optional[str], float]:
+        for pattern, confidence in self.NUMBER_PATTERNS:
+            match = re.search(pattern, text, re.IGNORECASE)
+            if match:
+                return match.group(1), confidence
+        return None, 0.0
+
+    def _extract_vendor(self, text: str) -> Tuple[Optional[str], float]:
+        lines = text.split('\n')
+        skip_keywords = ['BON', 'FISCAL', 'TOTAL', 'DATA', 'NR', 'ORA']
+
+        for i, line in enumerate(lines[:5]):
+            line = line.strip()
+            if not line or re.match(r'^[\d.,\s]+$', line):
+                continue
+            if any(kw in line.upper() for kw in skip_keywords):
+                continue
+            vendor = re.sub(r'[^\w\s.,&-]', '', line).strip()
+            if len(vendor) >= 3:
+                return vendor, 0.7 - (i * 0.1)
+        return None, 0.0
+
+    def _extract_cui(self, text: str) -> Tuple[Optional[str], float]:
+        for pattern, confidence in self.CUI_PATTERNS:
+            match = re.search(pattern, text, re.IGNORECASE)
+            if match:
+                cui = match.group(1)
+                if 6 <= len(cui) <= 10:
+                    return cui, confidence
+        return None, 0.0
+```
+
+### 4. OCR Service (`ocr_service.py`)
+
+```python
+"""Main OCR service coordinating preprocessing, recognition, and extraction."""
+from typing import Optional, Tuple
+from pathlib import Path
+import asyncio
+from concurrent.futures import ThreadPoolExecutor
+
+from app.services.ocr_engine import OCREngine
+from app.services.ocr_extractor import ReceiptExtractor, ExtractionResult
+from app.services.image_preprocessor import ImagePreprocessor
+
+
+class OCRService:
+    """Service for OCR processing of receipt images."""
+
+    _executor = ThreadPoolExecutor(max_workers=2)
+
+    def __init__(self):
+        self.preprocessor = ImagePreprocessor()
+        self.ocr_engine = OCREngine()
+        self.extractor = ReceiptExtractor()
+
+    async def process_image(
+        self,
+        image_path: Path,
+        mime_type: str
+    ) -> Tuple[bool, str, Optional[ExtractionResult]]:
+        """Process receipt image and extract structured data."""
+        try:
+            result = await asyncio.get_event_loop().run_in_executor(
+                self._executor,
+                self._process_sync,
+                image_path,
+                mime_type
+            )
+            return result
+        except Exception as e:
+            return False, f"OCR processing failed: {str(e)}", None
+
+    def _process_sync(
+        self,
+        image_path: Path,
+        mime_type: str
+    ) -> Tuple[bool, str, Optional[ExtractionResult]]:
+        """Synchronous processing (runs in thread pool)."""
+
+        # Handle PDF
+        if mime_type == 'application/pdf':
+            images = self.preprocessor.pdf_to_images(image_path)
+            if not images:
+                return False, "Failed to extract images from PDF", None
+            image = images[0]  # First page only
+        else:
+            image = self.preprocessor.load_image(image_path)
+
+        # Preprocess
+        processed = self.preprocessor.preprocess(image)
+
+        # OCR
+        ocr_result = self.ocr_engine.recognize(processed)
+        if not ocr_result.text:
+            return False, "No text detected in image", None
+
+        # Extract fields
+        extraction = self.extractor.extract(ocr_result.text)
+        extraction.raw_text = ocr_result.text
+
+        # Detect receipt type
+        text_upper = ocr_result.text.upper()
+        if 'CHITANTA' in text_upper or 'CHITANȚĂ' in text_upper:
+            extraction.receipt_type = 'chitanta'
+        else:
+            extraction.receipt_type = 'bon_fiscal'
+
+        return True, "OCR processing successful", extraction
+```
+
+### 5. Schemas (`schemas/ocr.py`)
+
+```python
+"""Pydantic schemas for OCR API."""
+from datetime import date
+from decimal import Decimal
+from typing import Optional
+from pydantic import BaseModel, Field
+
+
+class ExtractionData(BaseModel):
+    """Extracted receipt data."""
+    receipt_type: str = Field(default='bon_fiscal')
+    receipt_number: Optional[str] = None
+    receipt_series: Optional[str] = None
+    receipt_date: Optional[date] = None
+    amount: Optional[Decimal] = None
+    partner_name: Optional[str] = None
+    cui: Optional[str] = None
+    description: Optional[str] = None
+
+    confidence_amount: float = Field(default=0.0, ge=0, le=1)
+    confidence_date: float = Field(default=0.0, ge=0, le=1)
+    confidence_vendor: float = Field(default=0.0, ge=0, le=1)
+    overall_confidence: float = Field(default=0.0, ge=0, le=1)
+    raw_text: str = Field(default="")
+
+
+class OCRResponse(BaseModel):
+    """OCR API response."""
+    success: bool
+    message: str
+    data: Optional[ExtractionData] = None
+```
+
+### 6. Router (`routers/ocr.py`)
+
+```python
+"""OCR API endpoints."""
+from pathlib import Path
+import tempfile
+import os
+
+from fastapi import APIRouter, HTTPException, UploadFile, File, Depends
+from sqlalchemy.ext.asyncio import AsyncSession
+
+from app.db.database import get_session
+from app.db.crud.attachment import AttachmentCRUD
+from app.services.ocr_service import OCRService
+from app.schemas.ocr import OCRResponse
+
+router = APIRouter()
+ocr_service = OCRService()
+
+
+@router.post("/extract", response_model=OCRResponse)
+async def extract_from_image(file: UploadFile = File(...)):
+    """Extract receipt data from uploaded image."""
+    allowed_types = ['image/jpeg', 'image/png', 'application/pdf']
+    if file.content_type not in allowed_types:
+        raise HTTPException(400, f"File type not supported: {file.content_type}")
+
+    suffix = Path(file.filename).suffix if file.filename else '.jpg'
+    with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
+        content = await file.read()
+        tmp.write(content)
+        tmp_path = Path(tmp.name)
+
+    try:
+        success, message, result = await ocr_service.process_image(
+            tmp_path, file.content_type
+        )
+        if not success:
+            raise HTTPException(422, message)
+
+        return OCRResponse(success=True, message=message, data=result)
+    finally:
+        os.unlink(tmp_path)
+
+
+@router.post("/extract-attachment/{attachment_id}", response_model=OCRResponse)
+async def extract_from_attachment(
+    attachment_id: int,
+    session: AsyncSession = Depends(get_session),
+):
+    """Extract receipt data from existing attachment."""
+    attachment = await AttachmentCRUD.get_by_id(session, attachment_id)
+    if not attachment:
+        raise HTTPException(404, "Attachment not found")
+
+    file_path = AttachmentCRUD.get_file_path(attachment)
+    if not file_path.exists():
+        raise HTTPException(404, "File not found on disk")
+
+    success, message, result = await ocr_service.process_image(
+        file_path, attachment.mime_type
+    )
+    if not success:
+        raise HTTPException(422, message)
+
+    return OCRResponse(success=True, message=message, data=result)
+```
+
+---
+
+## Dependențe
+
+### Python (`requirements.txt` - adaugă)
+```
+# OCR Dependencies
+paddleocr>=2.7.0
+paddlepaddle>=2.5.0
+opencv-python>=4.8.0
+pytesseract>=0.3.10
+pdf2image>=1.16.0
+```
+
+### Sistem (Linux/Docker)
+```bash
+apt-get install -y \
+    tesseract-ocr \
+    tesseract-ocr-ron \
+    tesseract-ocr-eng \
+    poppler-utils \
+    libgl1-mesa-glx \
+    libglib2.0-0
+```
+
+---
+
+## User Flow
+
+```
+1. User deschide "Bon Fiscal Nou"
+2. User trage/selectează poza bonului în OCRUploadZone
+3. [Spinner 2-3 sec] "Se procesează imaginea..."
+4. Apare OCRPreview cu date extrase + confidence indicators
+5. User click "Aplică datele" sau corectează manual
+6. Formularul se completează automat
+7. User selectează tip cheltuială, casa de marcat
+8. User salvează draft sau trimite pentru aprobare
+```
+
+---
+
+## Pași Implementare
+
+### Pasul 1: Dependențe și setup
+- [ ] Adaugă dependențe în `requirements.txt`
+- [ ] Instalează pachete sistem (tesseract, poppler)
+- [ ] Testează import PaddleOCR
+
+### Pasul 2: Backend services
+- [ ] Creează `image_preprocessor.py`
+- [ ] Creează `ocr_engine.py`
+- [ ] Creează `ocr_extractor.py`
+- [ ] Creează `ocr_service.py`
+- [ ] Creează `schemas/ocr.py`
+
+### Pasul 3: API endpoint
+- [ ] Creează `routers/ocr.py`
+- [ ] Include router în `main.py`
+- [ ] Testează endpoint
+
+### Pasul 4: Frontend components
+- [ ] Creează `OCRUploadZone.vue`
+- [ ] Creează `OCRPreview.vue`
+- [ ] Creează `OCRConfidenceIndicator.vue`
+
+### Pasul 5: Integrare
+- [ ] Modifică `ReceiptCreateView.vue`
+- [ ] Adaugă auto-fill din OCR result
+- [ ] Adaugă feedback vizual
+
+### Pasul 6: Testing
+- [ ] Testează pe sample bonuri românești
+- [ ] Ajustează regex patterns
+- [ ] Optimizează preprocessing
+
+---
+
+## Referințe Fișiere Existente
+
+- `data-entry-app/backend/app/services/receipt_service.py` - Pattern servicii
+- `data-entry-app/backend/app/db/crud/attachment.py` - File handling
+- `data-entry-app/backend/app/schemas/receipt.py` - Schema patterns
+- `data-entry-app/backend/app/db/models/receipt.py` - Receipt model
+- `data-entry-app/frontend/src/views/receipts/ReceiptCreateView.vue` - View de modificat
+- `data-entry-app/CLAUDE.md` - Instrucțiuni specifice data-entry
--- a/docs/data-entry/ARCHITECTURE.md
+++ b/docs/data-entry/ARCHITECTURE.md
@@ -80,13 +80,14 @@ data/uploads/
 │  │   Vue.js     │     │   FastAPI    │     │   (staging)  │   │
 │  │   :3010      │     │   :8003      │     │              │   │
 │  └──────────────┘     └──────┬───────┘     └──────────────┘   │
-│                              │                                  │
-│                              │ Nomenclatoare                    │
-│                              ▼                                  │
-│                       ┌──────────────┐                         │
-│                       │   Oracle     │                         │
-│                       │ (read-only)  │                         │
-│                       └──────────────┘                         │
+│        │                     │                                  │
+│        │ OCR Upload          │ Nomenclatoare                    │
+│        ▼                     ▼                                  │
+│  ┌──────────────┐     ┌──────────────┐                         │
+│  │  OCR Service │     │   Oracle     │                         │
+│  │  PaddleOCR   │     │ (read-only)  │                         │
+│  │  +Tesseract  │     └──────────────┘                         │
+│  └──────────────┘                                               │
 │                                                                 │
 └─────────────────────────────────────────────────────────────────┘
 ```
@@ -258,18 +259,109 @@ JWT_SECRET_KEY=***
 JWT_ALGORITHM=HS256
 ```

+## OCR Processing Pipeline
+
+### 5. OCR Architecture
+
+**Alegere**: PaddleOCR (primar) + Tesseract (fallback), procesare 100% locala
+
+**Motivatie**:
+- Zero costuri externe (fara API-uri cloud)
+- Procesare on-premise (date sensibile raman locale)
+- PaddleOCR: acuratete ridicata, CPU-friendly
+- Tesseract: suport excelent pentru diacritice romanesti
+
+**Stack OCR**:
+```
+┌─────────────────────────────────────────────────────┐
+│                   OCR Pipeline                       │
+├─────────────────────────────────────────────────────┤
+│                                                      │
+│  Image Upload → ImagePreprocessor → OCREngine        │
+│       │              │                  │            │
+│       │              ▼                  ▼            │
+│       │         ┌─────────┐      ┌──────────────┐   │
+│       │         │ OpenCV  │      │ PaddleOCR    │   │
+│       │         │ Pipeline│      │ (primary)    │   │
+│       │         └─────────┘      └──────┬───────┘   │
+│       │              │                  │            │
+│       │              │           fallback│            │
+│       │              │                  ▼            │
+│       │              │           ┌──────────────┐   │
+│       │              │           │ Tesseract    │   │
+│       │              │           │ (ron+eng)    │   │
+│       │              │           └──────────────┘   │
+│       │              │                  │            │
+│       ▼              ▼                  ▼            │
+│  ┌──────────────────────────────────────────────┐   │
+│  │           ReceiptExtractor (Regex)           │   │
+│  │  - Amount patterns (TOTAL, DE PLATA)         │   │
+│  │  - Date patterns (DD.MM.YYYY)                │   │
+│  │  - CUI patterns (C.U.I., C.I.F.)             │   │
+│  │  - Vendor extraction (first lines)           │   │
+│  └──────────────────────────────────────────────┘   │
+│                        │                             │
+│                        ▼                             │
+│              ExtractionResult + Confidence           │
+│                                                      │
+└─────────────────────────────────────────────────────┘
+```
+
+### Image Preprocessing Pipeline
+
+```python
+def preprocess(image):
+    1. Convert to grayscale
+    2. Resize if width < 1000px (upscale for better OCR)
+    3. Deskew using Hough lines (straighten rotated text)
+    4. Denoise (Non-local means denoising)
+    5. Adaptive thresholding (binarization)
+    6. Morphological close (connect broken characters)
+    return processed_image
+```
+
+### Extraction Patterns (Romanian Receipts)
+
+| Pattern Type | Regex Examples | Confidence |
+|--------------|----------------|------------|
+| Amount | `TOTAL\s*:?\s*([\d.,]+)` | 0.95 |
+| Date | `(\d{2}[./]\d{2}[./]\d{4})` | 0.90 |
+| CUI | `C\.?U\.?I\.?\s*:?\s*(\d{6,10})` | 0.95 |
+| Receipt Number | `NR\.?\s*BON\s*:?\s*(\d+)` | 0.95 |
+| Vendor | First 5 non-keyword lines | 0.70 |
+
+### OCR API Endpoints
+
+```
+GET  /api/ocr/status                      # Check OCR availability
+POST /api/ocr/extract                     # Extract from uploaded image
+POST /api/ocr/extract-attachment/{id}     # Re-process existing attachment
+```
+
+### System Dependencies
+
+```bash
+# Ubuntu/Debian
+apt-get install -y \
+    tesseract-ocr tesseract-ocr-ron tesseract-ocr-eng \
+    poppler-utils libgl1-mesa-glx libglib2.0-0
+```
+
 ## Testing Strategy

 ### Unit Tests
 - CRUD operations
 - Workflow transitions
 - Entry generation logic
+- OCR extraction patterns

 ### Integration Tests
 - API endpoints
 - File upload/download
 - Oracle nomenclature fetch
+- OCR endpoint with sample receipts

 ### E2E Tests
 - Complete workflow: create → submit → approve
 - File upload cu preview
+- OCR extraction → form auto-fill
--- a/docs/data-entry/REQUIREMENTS.md
+++ b/docs/data-entry/REQUIREMENTS.md
@@ -3,6 +3,7 @@
 ## Obiectiv

 Sistem de introducere bonuri fiscale cu:
+- **OCR automat** pentru extragerea datelor din poze bonuri (100% local, fara costuri)
 - **Upload poze** bonuri de la utilizatori
 - **Generare automata** note contabile (staging area)
 - **Aprobare de contabil** inainte de finalizare
@@ -13,8 +14,10 @@ Sistem de introducere bonuri fiscale cu:

 ### 1. Gestiune Bonuri Fiscale

-#### 1.1 Creare Bon
- Utilizatorul poate uploada o poza a bonului fiscal
+#### 1.1 Creare Bon cu OCR
+- Utilizatorul uploadeaza poza bonului fiscal
+- **OCR extrage automat**: suma, data, furnizor, CUI, numar bon
+- Utilizatorul verifica si corecteaza datele extrase
 - Campuri obligatorii: tip document, directie, data, suma, furnizor, casa/banca
 - Campuri optionale: numar bon, serie, descriere
 - Tipuri document: Bon Fiscal, Chitanta
@@ -145,11 +148,71 @@ GET    /api/receipts/cash-registers      # Case/Banci
 GET    /api/receipts/expense-types       # Tipuri cheltuieli
 ```

+### OCR
+```
+GET    /api/ocr/status                   # Verifica disponibilitate OCR
+POST   /api/ocr/extract                  # Extrage date din imagine uploadata
+POST   /api/ocr/extract-attachment/{id}  # Re-proceseaza atasament existent
+```
+
+## OCR - Specificatii Tehnice
+
+### Cerinte OCR
+- **100% local** - fara costuri externe, fara API-uri cloud
+- **Full-auto** - completeaza formularul automat
+- **Input**: doar imagini (JPG/PNG/PDF)
+- **On-premise** - datele sensibile raman locale
+
+### Campuri Extrase Automat
+
+| Camp | Tip | Acuratete Estimata |
+|------|-----|-------------------|
+| Suma (TOTAL) | Decimal | 90-95% |
+| Data bon | Date | 85-90% |
+| Numar bon | String | 80-85% |
+| Furnizor | String | 70-80% |
+| CUI | String | 85-90% |
+| Tip document | Enum | 95%+ |
+
+### Stack Tehnic OCR
+
+| Component | Solutie | Justificare |
+|-----------|---------|-------------|
+| **OCR Engine** | PaddleOCR (primar) | 85-92% acuratete, pip install, CPU-friendly |
+| **Fallback OCR** | Tesseract + ron | Suport excelent diacritice romanesti |
+| **Extractie** | Regex/rules-based | Zero dependente extra, rapid (<100ms) |
+| **Preprocessing** | OpenCV | Deskew, binarizare, denoise |
+| **PDF → Image** | pdf2image + Poppler | Standard, fiabil |
+
+### Dependente Sistem (Linux)
+
+```bash
+apt-get install -y \
+    tesseract-ocr tesseract-ocr-ron tesseract-ocr-eng \
+    poppler-utils libgl1-mesa-glx libglib2.0-0
+```
+
+### User Flow OCR
+
+```
+1. User deschide "Bon Fiscal Nou"
+2. User trage/selecteaza poza bonului
+3. Click "Proceseaza cu OCR"
+4. [Spinner 2-3 sec] "Se proceseaza imaginea..."
+5. Apare preview cu date extrase + indicatori incredere
+6. User click "Aplica datele" sau corecteaza manual
+7. Formularul se completeaza automat
+8. User selecteaza tip cheltuiala, casa de marcat
+9. User salveaza draft sau trimite pentru aprobare
+```
+
 ## Criterii de Succes (Faza 1)

- [ ] Utilizator poate uploada poza bon + date de baza
- [ ] Sistem genereaza automat note contabile
- [ ] Contabil poate vedea, edita si aproba note
- [ ] Bonurile aprobate sunt vizibile in lista
- [ ] Migrarile Alembic functioneaza corect
- [ ] Poze bonuri se salveaza si se afiseaza corect
+- [x] Utilizator poate uploada poza bon + date de baza
+- [x] **OCR extrage automat date din poza bonului**
+- [x] **Indicatori de incredere pentru date extrase**
+- [x] Sistem genereaza automat note contabile
+- [x] Contabil poate vedea, edita si aproba note
+- [x] Bonurile aprobate sunt vizibile in lista
+- [x] Migrarile Alembic functioneaza corect
+- [x] Poze bonuri se salveaza si se afiseaza corect