Files

Marius Mutu 41ae97180e feat: Add OCR integration for automatic receipt data extraction

Implement Tesseract-based OCR to automatically extract vendor name,
date, total amount, and VAT from uploaded receipt images/PDFs,
reducing manual data entry and improving accuracy.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-12 11:48:29 +02:00

22 KiB

Raw Blame History

OCR Implementation Plan - Data Entry App

Context Handover Document Created: 2025-12-11 Branch: feature/data-entry-receipts Status: Ready for implementation

Executive Summary

Implementare OCR 100% local (fără costuri externe) pentru extragerea automată a datelor din bonuri fiscale/chitanțe românești. Soluția folosește PaddleOCR + regex extraction cu full-auto completion a formularului.

Cerințe utilizator:

Open-source local, fără costuri externe
Full-auto: completează formularul automat
Input: doar imagini (JPG/PNG/PDF)
On-premise processing

Stack Tehnic Recomandat

Component	Soluție	Justificare
OCR Engine	PaddleOCR (primar)	85-92% acuratețe, pip install simplu, CPU-friendly
Fallback OCR	Tesseract + ron	Suport excelent diacritice românești
Extracție	Regex/rules-based	Zero dependențe extra, rapid (<100ms), deterministic
Preprocessing	OpenCV	Deskew, binarizare, denoise - esențial pentru bonuri termice
PDF → Image	pdf2image + Poppler	Standard, fiabil

Fișiere de Creat

Backend (Noi)

data-entry-app/backend/app/
├── services/
│   ├── ocr_service.py          # Orchestrare OCR (async)
│   ├── ocr_engine.py           # Wrapper PaddleOCR + Tesseract
│   ├── ocr_extractor.py        # Regex patterns pentru bonuri RO
│   └── image_preprocessor.py   # OpenCV pipeline
├── schemas/
│   └── ocr.py                  # ExtractionData, OCRResponse
└── routers/
    └── ocr.py                  # POST /api/ocr/extract

Frontend (Noi)

data-entry-app/frontend/src/components/ocr/
├── OCRUploadZone.vue           # Drag-drop + trigger OCR
├── OCRPreview.vue              # Preview date extrase
└── OCRConfidenceIndicator.vue  # Indicator vizual încredere

Modificări la fișiere existente

data-entry-app/backend/requirements.txt - adaugă dependențe OCR
data-entry-app/backend/app/main.py - include OCR router
data-entry-app/frontend/src/views/receipts/ReceiptCreateView.vue - integrare OCR

Câmpuri de Extras (din Receipt model)

Câmpurile țintă pentru OCR extraction (vezi data-entry-app/backend/app/db/models/receipt.py):

Câmp	Tip	Acuratețe estimată
`receipt_type`	Enum: BON_FISCAL, CHITANTA	95%+
`receipt_number`	String (max 50)	80-85%
`receipt_date`	Date	85-90%
`amount`	Decimal(2)	90-95%
`partner_name`	String (max 200)	70-80%
`cui`	String (fiscal code)	85-90%

API Design

`POST /api/ocr/extract`

Input: multipart/form-data cu fișier (JPG/PNG/PDF, max 10MB)

Output:

{
  "success": true,
  "message": "OCR processing successful",
  "data": {
    "receipt_type": "bon_fiscal",
    "receipt_number": "12345",
    "receipt_series": null,
    "receipt_date": "2024-01-15",
    "amount": 125.50,
    "partner_name": "MEGA IMAGE SRL",
    "cui": "12345678",
    "description": null,
    "confidence_amount": 0.95,
    "confidence_date": 0.90,
    "confidence_vendor": 0.75,
    "overall_confidence": 0.87,
    "raw_text": "BON FISCAL\nMEGA IMAGE SRL\n..."
  }
}

`POST /api/ocr/extract-attachment/{attachment_id}`

Re-procesează un attachment existent.

Implementare Detaliată

1. Image Preprocessor (`image_preprocessor.py`)

"""Image preprocessing for optimal OCR results."""
from pathlib import Path
from typing import List
import numpy as np
import cv2

try:
    import pdf2image
    PDF_AVAILABLE = True
except ImportError:
    PDF_AVAILABLE = False


class ImagePreprocessor:
    """Preprocess receipt images for OCR."""

    def load_image(self, path: Path) -> np.ndarray:
        """Load image from file."""
        image = cv2.imread(str(path))
        if image is None:
            raise ValueError(f"Could not load image: {path}")
        return image

    def pdf_to_images(self, path: Path, dpi: int = 300) -> List[np.ndarray]:
        """Convert PDF to images."""
        if not PDF_AVAILABLE:
            raise RuntimeError("pdf2image not available")
        images = pdf2image.convert_from_path(str(path), dpi=dpi)
        return [np.array(img) for img in images]

    def preprocess(self, image: np.ndarray) -> np.ndarray:
        """
        Apply preprocessing pipeline for thermal receipt images.

        Pipeline:
        1. Convert to grayscale
        2. Resize if too small (min 1000px width)
        3. Deskew (straighten rotated text)
        4. Denoise (Non-local means)
        5. Adaptive thresholding (binarization)
        6. Morphological close (connect broken chars)
        """
        # 1. Grayscale
        if len(image.shape) == 3:
            gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        else:
            gray = image.copy()

        # 2. Resize if too small
        height, width = gray.shape
        if width < 1000:
            scale = 1000 / width
            gray = cv2.resize(gray, None, fx=scale, fy=scale,
                            interpolation=cv2.INTER_CUBIC)

        # 3. Deskew
        gray = self._deskew(gray)

        # 4. Denoise
        denoised = cv2.fastNlMeansDenoising(gray, h=10,
                                            templateWindowSize=7,
                                            searchWindowSize=21)

        # 5. Adaptive thresholding
        binary = cv2.adaptiveThreshold(
            denoised, 255,
            cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
            cv2.THRESH_BINARY,
            blockSize=15, C=8
        )

        # 6. Morphological close
        kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
        result = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)

        return result

    def _deskew(self, image: np.ndarray) -> np.ndarray:
        """Correct image rotation/skew using Hough lines."""
        edges = cv2.Canny(image, 50, 150, apertureSize=3)
        lines = cv2.HoughLinesP(edges, 1, np.pi/180,
                               threshold=100, minLineLength=100, maxLineGap=10)

        if lines is None:
            return image

        angles = []
        for line in lines:
            x1, y1, x2, y2 = line[0]
            angle = np.arctan2(y2 - y1, x2 - x1) * 180 / np.pi
            if abs(angle) < 45:
                angles.append(angle)

        if not angles:
            return image

        median_angle = np.median(angles)
        if abs(median_angle) < 0.5:
            return image

        h, w = image.shape[:2]
        center = (w // 2, h // 2)
        M = cv2.getRotationMatrix2D(center, median_angle, 1.0)
        return cv2.warpAffine(image, M, (w, h),
                             flags=cv2.INTER_CUBIC,
                             borderMode=cv2.BORDER_REPLICATE)

2. OCR Engine (`ocr_engine.py`)

"""OCR engine wrapper for PaddleOCR and Tesseract."""
from dataclasses import dataclass
from typing import List
import numpy as np

try:
    from paddleocr import PaddleOCR
    PADDLE_AVAILABLE = True
except ImportError:
    PADDLE_AVAILABLE = False

try:
    import pytesseract
    TESSERACT_AVAILABLE = True
except ImportError:
    TESSERACT_AVAILABLE = False


@dataclass
class OCRResult:
    """Raw OCR result."""
    text: str
    confidence: float
    boxes: List[dict]


class OCREngine:
    """Unified OCR engine with fallback support."""

    def __init__(self):
        self._paddle = None
        self._init_engines()

    def _init_engines(self):
        if PADDLE_AVAILABLE:
            self._paddle = PaddleOCR(
                use_angle_cls=True,
                lang='en',  # Better for mixed text
                use_gpu=False,
                show_log=False,
                det_db_thresh=0.3,
                det_db_box_thresh=0.5,
            )

    def recognize(self, image: np.ndarray) -> OCRResult:
        """Perform OCR on preprocessed image."""
        if PADDLE_AVAILABLE and self._paddle:
            return self._paddle_recognize(image)
        elif TESSERACT_AVAILABLE:
            return self._tesseract_recognize(image)
        else:
            raise RuntimeError("No OCR engine available")

    def _paddle_recognize(self, image: np.ndarray) -> OCRResult:
        result = self._paddle.ocr(image, cls=True)

        if not result or not result[0]:
            return OCRResult(text="", confidence=0.0, boxes=[])

        lines = []
        total_conf = 0.0
        boxes = []

        for line in result[0]:
            box, (text, conf) = line
            lines.append(text)
            total_conf += conf
            boxes.append({'text': text, 'confidence': conf, 'box': box})

        avg_conf = total_conf / len(result[0]) if result[0] else 0.0
        return OCRResult(text='\n'.join(lines), confidence=avg_conf, boxes=boxes)

    def _tesseract_recognize(self, image: np.ndarray) -> OCRResult:
        config = '--psm 6 -l ron+eng'
        text = pytesseract.image_to_string(image, config=config)
        data = pytesseract.image_to_data(image, config=config,
                                         output_type=pytesseract.Output.DICT)
        confidences = [int(c) for c in data['conf'] if int(c) > 0]
        avg_conf = sum(confidences) / len(confidences) / 100 if confidences else 0.0
        return OCRResult(text=text, confidence=avg_conf, boxes=[])

3. Receipt Extractor (`ocr_extractor.py`)

"""Extract structured fields from OCR text."""
import re
from datetime import date, datetime
from decimal import Decimal, InvalidOperation
from typing import Optional, Tuple
from dataclasses import dataclass


@dataclass
class ExtractionResult:
    """Structured extraction result."""
    receipt_type: str = 'bon_fiscal'
    receipt_number: Optional[str] = None
    receipt_series: Optional[str] = None
    receipt_date: Optional[date] = None
    amount: Optional[Decimal] = None
    partner_name: Optional[str] = None
    cui: Optional[str] = None
    description: Optional[str] = None

    confidence_amount: float = 0.0
    confidence_date: float = 0.0
    confidence_vendor: float = 0.0
    raw_text: str = ""

    @property
    def overall_confidence(self) -> float:
        weights = {'amount': 0.4, 'date': 0.3, 'vendor': 0.3}
        return round(
            self.confidence_amount * weights['amount'] +
            self.confidence_date * weights['date'] +
            self.confidence_vendor * weights['vendor'], 2
        )


class ReceiptExtractor:
    """Extract receipt fields using pattern matching."""

    TOTAL_PATTERNS = [
        (r'TOTAL\s*:?\s*([\d\s.,]+)\s*(?:RON|LEI)?', 0.95),
        (r'TOTAL\s+(?:RON|LEI)\s*([\d\s.,]+)', 0.95),
        (r'DE\s+PLATA\s*:?\s*([\d\s.,]+)', 0.90),
        (r'SUMA\s*:?\s*([\d\s.,]+)', 0.85),
    ]

    DATE_PATTERNS = [
        (r'DATA\s*:?\s*(\d{2}[./]\d{2}[./]\d{4})', 0.95),
        (r'(\d{2}[./]\d{2}[./]\d{4})\s+\d{2}:\d{2}', 0.90),
        (r'(\d{2}[./]\d{2}[./]\d{4})', 0.80),
    ]

    NUMBER_PATTERNS = [
        (r'NR\.?\s*BON\s*:?\s*(\d+)', 0.95),
        (r'BON\s+(?:FISCAL\s+)?NR\.?\s*:?\s*(\d+)', 0.95),
        (r'NR\.?\s*:?\s*(\d{4,})', 0.70),
    ]

    CUI_PATTERNS = [
        (r'C\.?U\.?I\.?\s*:?\s*(?:RO)?(\d{6,10})', 0.95),
        (r'C\.?I\.?F\.?\s*:?\s*(?:RO)?(\d{6,10})', 0.95),
    ]

    def extract(self, text: str) -> ExtractionResult:
        result = ExtractionResult()
        text_upper = text.upper()

        result.amount, result.confidence_amount = self._extract_amount(text_upper)
        result.receipt_date, result.confidence_date = self._extract_date(text_upper)
        result.receipt_number, _ = self._extract_number(text_upper)
        result.partner_name, result.confidence_vendor = self._extract_vendor(text)
        result.cui, _ = self._extract_cui(text_upper)

        return result

    def _extract_amount(self, text: str) -> Tuple[Optional[Decimal], float]:
        for pattern, confidence in self.TOTAL_PATTERNS:
            match = re.search(pattern, text, re.IGNORECASE | re.MULTILINE)
            if match:
                try:
                    amount_str = re.sub(r'[^\d.,]', '', match.group(1))
                    amount_str = amount_str.replace(',', '.')
                    parts = amount_str.split('.')
                    if len(parts) > 2:
                        amount_str = ''.join(parts[:-1]) + '.' + parts[-1]
                    amount = Decimal(amount_str)
                    if amount > 0:
                        return amount, confidence
                except (InvalidOperation, ValueError):
                    continue
        return None, 0.0

    def _extract_date(self, text: str) -> Tuple[Optional[date], float]:
        for pattern, confidence in self.DATE_PATTERNS:
            match = re.search(pattern, text)
            if match:
                try:
                    date_str = match.group(1).replace('/', '.')
                    parsed = datetime.strptime(date_str, '%d.%m.%Y').date()
                    today = date.today()
                    if parsed <= today and parsed.year >= 2020:
                        return parsed, confidence
                except ValueError:
                    continue
        return None, 0.0

    def _extract_number(self, text: str) -> Tuple[Optional[str], float]:
        for pattern, confidence in self.NUMBER_PATTERNS:
            match = re.search(pattern, text, re.IGNORECASE)
            if match:
                return match.group(1), confidence
        return None, 0.0

    def _extract_vendor(self, text: str) -> Tuple[Optional[str], float]:
        lines = text.split('\n')
        skip_keywords = ['BON', 'FISCAL', 'TOTAL', 'DATA', 'NR', 'ORA']

        for i, line in enumerate(lines[:5]):
            line = line.strip()
            if not line or re.match(r'^[\d.,\s]+$', line):
                continue
            if any(kw in line.upper() for kw in skip_keywords):
                continue
            vendor = re.sub(r'[^\w\s.,&-]', '', line).strip()
            if len(vendor) >= 3:
                return vendor, 0.7 - (i * 0.1)
        return None, 0.0

    def _extract_cui(self, text: str) -> Tuple[Optional[str], float]:
        for pattern, confidence in self.CUI_PATTERNS:
            match = re.search(pattern, text, re.IGNORECASE)
            if match:
                cui = match.group(1)
                if 6 <= len(cui) <= 10:
                    return cui, confidence
        return None, 0.0

4. OCR Service (`ocr_service.py`)

"""Main OCR service coordinating preprocessing, recognition, and extraction."""
from typing import Optional, Tuple
from pathlib import Path
import asyncio
from concurrent.futures import ThreadPoolExecutor

from app.services.ocr_engine import OCREngine
from app.services.ocr_extractor import ReceiptExtractor, ExtractionResult
from app.services.image_preprocessor import ImagePreprocessor


class OCRService:
    """Service for OCR processing of receipt images."""

    _executor = ThreadPoolExecutor(max_workers=2)

    def __init__(self):
        self.preprocessor = ImagePreprocessor()
        self.ocr_engine = OCREngine()
        self.extractor = ReceiptExtractor()

    async def process_image(
        self,
        image_path: Path,
        mime_type: str
    ) -> Tuple[bool, str, Optional[ExtractionResult]]:
        """Process receipt image and extract structured data."""
        try:
            result = await asyncio.get_event_loop().run_in_executor(
                self._executor,
                self._process_sync,
                image_path,
                mime_type
            )
            return result
        except Exception as e:
            return False, f"OCR processing failed: {str(e)}", None

    def _process_sync(
        self,
        image_path: Path,
        mime_type: str
    ) -> Tuple[bool, str, Optional[ExtractionResult]]:
        """Synchronous processing (runs in thread pool)."""

        # Handle PDF
        if mime_type == 'application/pdf':
            images = self.preprocessor.pdf_to_images(image_path)
            if not images:
                return False, "Failed to extract images from PDF", None
            image = images[0]  # First page only
        else:
            image = self.preprocessor.load_image(image_path)

        # Preprocess
        processed = self.preprocessor.preprocess(image)

        # OCR
        ocr_result = self.ocr_engine.recognize(processed)
        if not ocr_result.text:
            return False, "No text detected in image", None

        # Extract fields
        extraction = self.extractor.extract(ocr_result.text)
        extraction.raw_text = ocr_result.text

        # Detect receipt type
        text_upper = ocr_result.text.upper()
        if 'CHITANTA' in text_upper or 'CHITANȚĂ' in text_upper:
            extraction.receipt_type = 'chitanta'
        else:
            extraction.receipt_type = 'bon_fiscal'

        return True, "OCR processing successful", extraction

5. Schemas (`schemas/ocr.py`)

"""Pydantic schemas for OCR API."""
from datetime import date
from decimal import Decimal
from typing import Optional
from pydantic import BaseModel, Field


class ExtractionData(BaseModel):
    """Extracted receipt data."""
    receipt_type: str = Field(default='bon_fiscal')
    receipt_number: Optional[str] = None
    receipt_series: Optional[str] = None
    receipt_date: Optional[date] = None
    amount: Optional[Decimal] = None
    partner_name: Optional[str] = None
    cui: Optional[str] = None
    description: Optional[str] = None

    confidence_amount: float = Field(default=0.0, ge=0, le=1)
    confidence_date: float = Field(default=0.0, ge=0, le=1)
    confidence_vendor: float = Field(default=0.0, ge=0, le=1)
    overall_confidence: float = Field(default=0.0, ge=0, le=1)
    raw_text: str = Field(default="")


class OCRResponse(BaseModel):
    """OCR API response."""
    success: bool
    message: str
    data: Optional[ExtractionData] = None

6. Router (`routers/ocr.py`)

"""OCR API endpoints."""
from pathlib import Path
import tempfile
import os

from fastapi import APIRouter, HTTPException, UploadFile, File, Depends
from sqlalchemy.ext.asyncio import AsyncSession

from app.db.database import get_session
from app.db.crud.attachment import AttachmentCRUD
from app.services.ocr_service import OCRService
from app.schemas.ocr import OCRResponse

router = APIRouter()
ocr_service = OCRService()


@router.post("/extract", response_model=OCRResponse)
async def extract_from_image(file: UploadFile = File(...)):
    """Extract receipt data from uploaded image."""
    allowed_types = ['image/jpeg', 'image/png', 'application/pdf']
    if file.content_type not in allowed_types:
        raise HTTPException(400, f"File type not supported: {file.content_type}")

    suffix = Path(file.filename).suffix if file.filename else '.jpg'
    with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
        content = await file.read()
        tmp.write(content)
        tmp_path = Path(tmp.name)

    try:
        success, message, result = await ocr_service.process_image(
            tmp_path, file.content_type
        )
        if not success:
            raise HTTPException(422, message)

        return OCRResponse(success=True, message=message, data=result)
    finally:
        os.unlink(tmp_path)


@router.post("/extract-attachment/{attachment_id}", response_model=OCRResponse)
async def extract_from_attachment(
    attachment_id: int,
    session: AsyncSession = Depends(get_session),
):
    """Extract receipt data from existing attachment."""
    attachment = await AttachmentCRUD.get_by_id(session, attachment_id)
    if not attachment:
        raise HTTPException(404, "Attachment not found")

    file_path = AttachmentCRUD.get_file_path(attachment)
    if not file_path.exists():
        raise HTTPException(404, "File not found on disk")

    success, message, result = await ocr_service.process_image(
        file_path, attachment.mime_type
    )
    if not success:
        raise HTTPException(422, message)

    return OCRResponse(success=True, message=message, data=result)

Dependențe

Python (`requirements.txt` - adaugă)

# OCR Dependencies
paddleocr>=2.7.0
paddlepaddle>=2.5.0
opencv-python>=4.8.0
pytesseract>=0.3.10
pdf2image>=1.16.0

Sistem (Linux/Docker)

apt-get install -y \
    tesseract-ocr \
    tesseract-ocr-ron \
    tesseract-ocr-eng \
    poppler-utils \
    libgl1-mesa-glx \
    libglib2.0-0

User Flow

1. User deschide "Bon Fiscal Nou"
2. User trage/selectează poza bonului în OCRUploadZone
3. [Spinner 2-3 sec] "Se procesează imaginea..."
4. Apare OCRPreview cu date extrase + confidence indicators
5. User click "Aplică datele" sau corectează manual
6. Formularul se completează automat
7. User selectează tip cheltuială, casa de marcat
8. User salvează draft sau trimite pentru aprobare

Pași Implementare

Pasul 1: Dependențe și setup

Adaugă dependențe în requirements.txt
Instalează pachete sistem (tesseract, poppler)
Testează import PaddleOCR

Pasul 2: Backend services

Creează image_preprocessor.py
Creează ocr_engine.py
Creează ocr_extractor.py
Creează ocr_service.py
Creează schemas/ocr.py

Pasul 3: API endpoint

Creează routers/ocr.py
Include router în main.py
Testează endpoint

Pasul 4: Frontend components

Creează OCRUploadZone.vue
Creează OCRPreview.vue
Creează OCRConfidenceIndicator.vue

Pasul 5: Integrare

Modifică ReceiptCreateView.vue
Adaugă auto-fill din OCR result
Adaugă feedback vizual

Pasul 6: Testing

Testează pe sample bonuri românești
Ajustează regex patterns
Optimizează preprocessing

Referințe Fișiere Existente

data-entry-app/backend/app/services/receipt_service.py - Pattern servicii
data-entry-app/backend/app/db/crud/attachment.py - File handling
data-entry-app/backend/app/schemas/receipt.py - Schema patterns
data-entry-app/backend/app/db/models/receipt.py - Receipt model
data-entry-app/frontend/src/views/receipts/ReceiptCreateView.vue - View de modificat
data-entry-app/CLAUDE.md - Instrucțiuni specifice data-entry

22 KiB Raw Blame History