feat: Add OCR integration for automatic receipt data extraction

Implement Tesseract-based OCR to automatically extract vendor name,
date, total amount, and VAT from uploaded receipt images/PDFs,
reducing manual data entry and improving accuracy.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2025-12-12 11:48:29 +02:00
parent 5960154094
commit 41ae97180e
16 changed files with 2773 additions and 32 deletions

View File

@@ -0,0 +1,717 @@
# OCR Implementation Plan - Data Entry App
> **Context Handover Document**
> Created: 2025-12-11
> Branch: `feature/data-entry-receipts`
> Status: Ready for implementation
## Executive Summary
Implementare OCR 100% local (fără costuri externe) pentru extragerea automată a datelor din bonuri fiscale/chitanțe românești. Soluția folosește PaddleOCR + regex extraction cu full-auto completion a formularului.
**Cerințe utilizator:**
- Open-source local, fără costuri externe
- Full-auto: completează formularul automat
- Input: doar imagini (JPG/PNG/PDF)
- On-premise processing
---
## Stack Tehnic Recomandat
| Component | Soluție | Justificare |
|-----------|---------|-------------|
| **OCR Engine** | PaddleOCR (primar) | 85-92% acuratețe, pip install simplu, CPU-friendly |
| **Fallback OCR** | Tesseract + ron | Suport excelent diacritice românești |
| **Extracție** | Regex/rules-based | Zero dependențe extra, rapid (<100ms), deterministic |
| **Preprocessing** | OpenCV | Deskew, binarizare, denoise - esențial pentru bonuri termice |
| **PDF → Image** | pdf2image + Poppler | Standard, fiabil |
---
## Fișiere de Creat
### Backend (Noi)
```
data-entry-app/backend/app/
├── services/
│ ├── ocr_service.py # Orchestrare OCR (async)
│ ├── ocr_engine.py # Wrapper PaddleOCR + Tesseract
│ ├── ocr_extractor.py # Regex patterns pentru bonuri RO
│ └── image_preprocessor.py # OpenCV pipeline
├── schemas/
│ └── ocr.py # ExtractionData, OCRResponse
└── routers/
└── ocr.py # POST /api/ocr/extract
```
### Frontend (Noi)
```
data-entry-app/frontend/src/components/ocr/
├── OCRUploadZone.vue # Drag-drop + trigger OCR
├── OCRPreview.vue # Preview date extrase
└── OCRConfidenceIndicator.vue # Indicator vizual încredere
```
### Modificări la fișiere existente
- `data-entry-app/backend/requirements.txt` - adaugă dependențe OCR
- `data-entry-app/backend/app/main.py` - include OCR router
- `data-entry-app/frontend/src/views/receipts/ReceiptCreateView.vue` - integrare OCR
---
## Câmpuri de Extras (din Receipt model)
Câmpurile țintă pentru OCR extraction (vezi `data-entry-app/backend/app/db/models/receipt.py`):
| Câmp | Tip | Acuratețe estimată |
|------|-----|-------------------|
| `receipt_type` | Enum: BON_FISCAL, CHITANTA | 95%+ |
| `receipt_number` | String (max 50) | 80-85% |
| `receipt_date` | Date | 85-90% |
| `amount` | Decimal(2) | 90-95% |
| `partner_name` | String (max 200) | 70-80% |
| `cui` | String (fiscal code) | 85-90% |
---
## API Design
### `POST /api/ocr/extract`
**Input**: `multipart/form-data` cu fișier (JPG/PNG/PDF, max 10MB)
**Output**:
```json
{
"success": true,
"message": "OCR processing successful",
"data": {
"receipt_type": "bon_fiscal",
"receipt_number": "12345",
"receipt_series": null,
"receipt_date": "2024-01-15",
"amount": 125.50,
"partner_name": "MEGA IMAGE SRL",
"cui": "12345678",
"description": null,
"confidence_amount": 0.95,
"confidence_date": 0.90,
"confidence_vendor": 0.75,
"overall_confidence": 0.87,
"raw_text": "BON FISCAL\nMEGA IMAGE SRL\n..."
}
}
```
### `POST /api/ocr/extract-attachment/{attachment_id}`
Re-procesează un attachment existent.
---
## Implementare Detaliată
### 1. Image Preprocessor (`image_preprocessor.py`)
```python
"""Image preprocessing for optimal OCR results."""
from pathlib import Path
from typing import List
import numpy as np
import cv2
try:
import pdf2image
PDF_AVAILABLE = True
except ImportError:
PDF_AVAILABLE = False
class ImagePreprocessor:
"""Preprocess receipt images for OCR."""
def load_image(self, path: Path) -> np.ndarray:
"""Load image from file."""
image = cv2.imread(str(path))
if image is None:
raise ValueError(f"Could not load image: {path}")
return image
def pdf_to_images(self, path: Path, dpi: int = 300) -> List[np.ndarray]:
"""Convert PDF to images."""
if not PDF_AVAILABLE:
raise RuntimeError("pdf2image not available")
images = pdf2image.convert_from_path(str(path), dpi=dpi)
return [np.array(img) for img in images]
def preprocess(self, image: np.ndarray) -> np.ndarray:
"""
Apply preprocessing pipeline for thermal receipt images.
Pipeline:
1. Convert to grayscale
2. Resize if too small (min 1000px width)
3. Deskew (straighten rotated text)
4. Denoise (Non-local means)
5. Adaptive thresholding (binarization)
6. Morphological close (connect broken chars)
"""
# 1. Grayscale
if len(image.shape) == 3:
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
else:
gray = image.copy()
# 2. Resize if too small
height, width = gray.shape
if width < 1000:
scale = 1000 / width
gray = cv2.resize(gray, None, fx=scale, fy=scale,
interpolation=cv2.INTER_CUBIC)
# 3. Deskew
gray = self._deskew(gray)
# 4. Denoise
denoised = cv2.fastNlMeansDenoising(gray, h=10,
templateWindowSize=7,
searchWindowSize=21)
# 5. Adaptive thresholding
binary = cv2.adaptiveThreshold(
denoised, 255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,
blockSize=15, C=8
)
# 6. Morphological close
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
result = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)
return result
def _deskew(self, image: np.ndarray) -> np.ndarray:
"""Correct image rotation/skew using Hough lines."""
edges = cv2.Canny(image, 50, 150, apertureSize=3)
lines = cv2.HoughLinesP(edges, 1, np.pi/180,
threshold=100, minLineLength=100, maxLineGap=10)
if lines is None:
return image
angles = []
for line in lines:
x1, y1, x2, y2 = line[0]
angle = np.arctan2(y2 - y1, x2 - x1) * 180 / np.pi
if abs(angle) < 45:
angles.append(angle)
if not angles:
return image
median_angle = np.median(angles)
if abs(median_angle) < 0.5:
return image
h, w = image.shape[:2]
center = (w // 2, h // 2)
M = cv2.getRotationMatrix2D(center, median_angle, 1.0)
return cv2.warpAffine(image, M, (w, h),
flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REPLICATE)
```
### 2. OCR Engine (`ocr_engine.py`)
```python
"""OCR engine wrapper for PaddleOCR and Tesseract."""
from dataclasses import dataclass
from typing import List
import numpy as np
try:
from paddleocr import PaddleOCR
PADDLE_AVAILABLE = True
except ImportError:
PADDLE_AVAILABLE = False
try:
import pytesseract
TESSERACT_AVAILABLE = True
except ImportError:
TESSERACT_AVAILABLE = False
@dataclass
class OCRResult:
"""Raw OCR result."""
text: str
confidence: float
boxes: List[dict]
class OCREngine:
"""Unified OCR engine with fallback support."""
def __init__(self):
self._paddle = None
self._init_engines()
def _init_engines(self):
if PADDLE_AVAILABLE:
self._paddle = PaddleOCR(
use_angle_cls=True,
lang='en', # Better for mixed text
use_gpu=False,
show_log=False,
det_db_thresh=0.3,
det_db_box_thresh=0.5,
)
def recognize(self, image: np.ndarray) -> OCRResult:
"""Perform OCR on preprocessed image."""
if PADDLE_AVAILABLE and self._paddle:
return self._paddle_recognize(image)
elif TESSERACT_AVAILABLE:
return self._tesseract_recognize(image)
else:
raise RuntimeError("No OCR engine available")
def _paddle_recognize(self, image: np.ndarray) -> OCRResult:
result = self._paddle.ocr(image, cls=True)
if not result or not result[0]:
return OCRResult(text="", confidence=0.0, boxes=[])
lines = []
total_conf = 0.0
boxes = []
for line in result[0]:
box, (text, conf) = line
lines.append(text)
total_conf += conf
boxes.append({'text': text, 'confidence': conf, 'box': box})
avg_conf = total_conf / len(result[0]) if result[0] else 0.0
return OCRResult(text='\n'.join(lines), confidence=avg_conf, boxes=boxes)
def _tesseract_recognize(self, image: np.ndarray) -> OCRResult:
config = '--psm 6 -l ron+eng'
text = pytesseract.image_to_string(image, config=config)
data = pytesseract.image_to_data(image, config=config,
output_type=pytesseract.Output.DICT)
confidences = [int(c) for c in data['conf'] if int(c) > 0]
avg_conf = sum(confidences) / len(confidences) / 100 if confidences else 0.0
return OCRResult(text=text, confidence=avg_conf, boxes=[])
```
### 3. Receipt Extractor (`ocr_extractor.py`)
```python
"""Extract structured fields from OCR text."""
import re
from datetime import date, datetime
from decimal import Decimal, InvalidOperation
from typing import Optional, Tuple
from dataclasses import dataclass
@dataclass
class ExtractionResult:
"""Structured extraction result."""
receipt_type: str = 'bon_fiscal'
receipt_number: Optional[str] = None
receipt_series: Optional[str] = None
receipt_date: Optional[date] = None
amount: Optional[Decimal] = None
partner_name: Optional[str] = None
cui: Optional[str] = None
description: Optional[str] = None
confidence_amount: float = 0.0
confidence_date: float = 0.0
confidence_vendor: float = 0.0
raw_text: str = ""
@property
def overall_confidence(self) -> float:
weights = {'amount': 0.4, 'date': 0.3, 'vendor': 0.3}
return round(
self.confidence_amount * weights['amount'] +
self.confidence_date * weights['date'] +
self.confidence_vendor * weights['vendor'], 2
)
class ReceiptExtractor:
"""Extract receipt fields using pattern matching."""
TOTAL_PATTERNS = [
(r'TOTAL\s*:?\s*([\d\s.,]+)\s*(?:RON|LEI)?', 0.95),
(r'TOTAL\s+(?:RON|LEI)\s*([\d\s.,]+)', 0.95),
(r'DE\s+PLATA\s*:?\s*([\d\s.,]+)', 0.90),
(r'SUMA\s*:?\s*([\d\s.,]+)', 0.85),
]
DATE_PATTERNS = [
(r'DATA\s*:?\s*(\d{2}[./]\d{2}[./]\d{4})', 0.95),
(r'(\d{2}[./]\d{2}[./]\d{4})\s+\d{2}:\d{2}', 0.90),
(r'(\d{2}[./]\d{2}[./]\d{4})', 0.80),
]
NUMBER_PATTERNS = [
(r'NR\.?\s*BON\s*:?\s*(\d+)', 0.95),
(r'BON\s+(?:FISCAL\s+)?NR\.?\s*:?\s*(\d+)', 0.95),
(r'NR\.?\s*:?\s*(\d{4,})', 0.70),
]
CUI_PATTERNS = [
(r'C\.?U\.?I\.?\s*:?\s*(?:RO)?(\d{6,10})', 0.95),
(r'C\.?I\.?F\.?\s*:?\s*(?:RO)?(\d{6,10})', 0.95),
]
def extract(self, text: str) -> ExtractionResult:
result = ExtractionResult()
text_upper = text.upper()
result.amount, result.confidence_amount = self._extract_amount(text_upper)
result.receipt_date, result.confidence_date = self._extract_date(text_upper)
result.receipt_number, _ = self._extract_number(text_upper)
result.partner_name, result.confidence_vendor = self._extract_vendor(text)
result.cui, _ = self._extract_cui(text_upper)
return result
def _extract_amount(self, text: str) -> Tuple[Optional[Decimal], float]:
for pattern, confidence in self.TOTAL_PATTERNS:
match = re.search(pattern, text, re.IGNORECASE | re.MULTILINE)
if match:
try:
amount_str = re.sub(r'[^\d.,]', '', match.group(1))
amount_str = amount_str.replace(',', '.')
parts = amount_str.split('.')
if len(parts) > 2:
amount_str = ''.join(parts[:-1]) + '.' + parts[-1]
amount = Decimal(amount_str)
if amount > 0:
return amount, confidence
except (InvalidOperation, ValueError):
continue
return None, 0.0
def _extract_date(self, text: str) -> Tuple[Optional[date], float]:
for pattern, confidence in self.DATE_PATTERNS:
match = re.search(pattern, text)
if match:
try:
date_str = match.group(1).replace('/', '.')
parsed = datetime.strptime(date_str, '%d.%m.%Y').date()
today = date.today()
if parsed <= today and parsed.year >= 2020:
return parsed, confidence
except ValueError:
continue
return None, 0.0
def _extract_number(self, text: str) -> Tuple[Optional[str], float]:
for pattern, confidence in self.NUMBER_PATTERNS:
match = re.search(pattern, text, re.IGNORECASE)
if match:
return match.group(1), confidence
return None, 0.0
def _extract_vendor(self, text: str) -> Tuple[Optional[str], float]:
lines = text.split('\n')
skip_keywords = ['BON', 'FISCAL', 'TOTAL', 'DATA', 'NR', 'ORA']
for i, line in enumerate(lines[:5]):
line = line.strip()
if not line or re.match(r'^[\d.,\s]+$', line):
continue
if any(kw in line.upper() for kw in skip_keywords):
continue
vendor = re.sub(r'[^\w\s.,&-]', '', line).strip()
if len(vendor) >= 3:
return vendor, 0.7 - (i * 0.1)
return None, 0.0
def _extract_cui(self, text: str) -> Tuple[Optional[str], float]:
for pattern, confidence in self.CUI_PATTERNS:
match = re.search(pattern, text, re.IGNORECASE)
if match:
cui = match.group(1)
if 6 <= len(cui) <= 10:
return cui, confidence
return None, 0.0
```
### 4. OCR Service (`ocr_service.py`)
```python
"""Main OCR service coordinating preprocessing, recognition, and extraction."""
from typing import Optional, Tuple
from pathlib import Path
import asyncio
from concurrent.futures import ThreadPoolExecutor
from app.services.ocr_engine import OCREngine
from app.services.ocr_extractor import ReceiptExtractor, ExtractionResult
from app.services.image_preprocessor import ImagePreprocessor
class OCRService:
"""Service for OCR processing of receipt images."""
_executor = ThreadPoolExecutor(max_workers=2)
def __init__(self):
self.preprocessor = ImagePreprocessor()
self.ocr_engine = OCREngine()
self.extractor = ReceiptExtractor()
async def process_image(
self,
image_path: Path,
mime_type: str
) -> Tuple[bool, str, Optional[ExtractionResult]]:
"""Process receipt image and extract structured data."""
try:
result = await asyncio.get_event_loop().run_in_executor(
self._executor,
self._process_sync,
image_path,
mime_type
)
return result
except Exception as e:
return False, f"OCR processing failed: {str(e)}", None
def _process_sync(
self,
image_path: Path,
mime_type: str
) -> Tuple[bool, str, Optional[ExtractionResult]]:
"""Synchronous processing (runs in thread pool)."""
# Handle PDF
if mime_type == 'application/pdf':
images = self.preprocessor.pdf_to_images(image_path)
if not images:
return False, "Failed to extract images from PDF", None
image = images[0] # First page only
else:
image = self.preprocessor.load_image(image_path)
# Preprocess
processed = self.preprocessor.preprocess(image)
# OCR
ocr_result = self.ocr_engine.recognize(processed)
if not ocr_result.text:
return False, "No text detected in image", None
# Extract fields
extraction = self.extractor.extract(ocr_result.text)
extraction.raw_text = ocr_result.text
# Detect receipt type
text_upper = ocr_result.text.upper()
if 'CHITANTA' in text_upper or 'CHITANȚĂ' in text_upper:
extraction.receipt_type = 'chitanta'
else:
extraction.receipt_type = 'bon_fiscal'
return True, "OCR processing successful", extraction
```
### 5. Schemas (`schemas/ocr.py`)
```python
"""Pydantic schemas for OCR API."""
from datetime import date
from decimal import Decimal
from typing import Optional
from pydantic import BaseModel, Field
class ExtractionData(BaseModel):
"""Extracted receipt data."""
receipt_type: str = Field(default='bon_fiscal')
receipt_number: Optional[str] = None
receipt_series: Optional[str] = None
receipt_date: Optional[date] = None
amount: Optional[Decimal] = None
partner_name: Optional[str] = None
cui: Optional[str] = None
description: Optional[str] = None
confidence_amount: float = Field(default=0.0, ge=0, le=1)
confidence_date: float = Field(default=0.0, ge=0, le=1)
confidence_vendor: float = Field(default=0.0, ge=0, le=1)
overall_confidence: float = Field(default=0.0, ge=0, le=1)
raw_text: str = Field(default="")
class OCRResponse(BaseModel):
"""OCR API response."""
success: bool
message: str
data: Optional[ExtractionData] = None
```
### 6. Router (`routers/ocr.py`)
```python
"""OCR API endpoints."""
from pathlib import Path
import tempfile
import os
from fastapi import APIRouter, HTTPException, UploadFile, File, Depends
from sqlalchemy.ext.asyncio import AsyncSession
from app.db.database import get_session
from app.db.crud.attachment import AttachmentCRUD
from app.services.ocr_service import OCRService
from app.schemas.ocr import OCRResponse
router = APIRouter()
ocr_service = OCRService()
@router.post("/extract", response_model=OCRResponse)
async def extract_from_image(file: UploadFile = File(...)):
"""Extract receipt data from uploaded image."""
allowed_types = ['image/jpeg', 'image/png', 'application/pdf']
if file.content_type not in allowed_types:
raise HTTPException(400, f"File type not supported: {file.content_type}")
suffix = Path(file.filename).suffix if file.filename else '.jpg'
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
content = await file.read()
tmp.write(content)
tmp_path = Path(tmp.name)
try:
success, message, result = await ocr_service.process_image(
tmp_path, file.content_type
)
if not success:
raise HTTPException(422, message)
return OCRResponse(success=True, message=message, data=result)
finally:
os.unlink(tmp_path)
@router.post("/extract-attachment/{attachment_id}", response_model=OCRResponse)
async def extract_from_attachment(
attachment_id: int,
session: AsyncSession = Depends(get_session),
):
"""Extract receipt data from existing attachment."""
attachment = await AttachmentCRUD.get_by_id(session, attachment_id)
if not attachment:
raise HTTPException(404, "Attachment not found")
file_path = AttachmentCRUD.get_file_path(attachment)
if not file_path.exists():
raise HTTPException(404, "File not found on disk")
success, message, result = await ocr_service.process_image(
file_path, attachment.mime_type
)
if not success:
raise HTTPException(422, message)
return OCRResponse(success=True, message=message, data=result)
```
---
## Dependențe
### Python (`requirements.txt` - adaugă)
```
# OCR Dependencies
paddleocr>=2.7.0
paddlepaddle>=2.5.0
opencv-python>=4.8.0
pytesseract>=0.3.10
pdf2image>=1.16.0
```
### Sistem (Linux/Docker)
```bash
apt-get install -y \
tesseract-ocr \
tesseract-ocr-ron \
tesseract-ocr-eng \
poppler-utils \
libgl1-mesa-glx \
libglib2.0-0
```
---
## User Flow
```
1. User deschide "Bon Fiscal Nou"
2. User trage/selectează poza bonului în OCRUploadZone
3. [Spinner 2-3 sec] "Se procesează imaginea..."
4. Apare OCRPreview cu date extrase + confidence indicators
5. User click "Aplică datele" sau corectează manual
6. Formularul se completează automat
7. User selectează tip cheltuială, casa de marcat
8. User salvează draft sau trimite pentru aprobare
```
---
## Pași Implementare
### Pasul 1: Dependențe și setup
- [ ] Adaugă dependențe în `requirements.txt`
- [ ] Instalează pachete sistem (tesseract, poppler)
- [ ] Testează import PaddleOCR
### Pasul 2: Backend services
- [ ] Creează `image_preprocessor.py`
- [ ] Creează `ocr_engine.py`
- [ ] Creează `ocr_extractor.py`
- [ ] Creează `ocr_service.py`
- [ ] Creează `schemas/ocr.py`
### Pasul 3: API endpoint
- [ ] Creează `routers/ocr.py`
- [ ] Include router în `main.py`
- [ ] Testează endpoint
### Pasul 4: Frontend components
- [ ] Creează `OCRUploadZone.vue`
- [ ] Creează `OCRPreview.vue`
- [ ] Creează `OCRConfidenceIndicator.vue`
### Pasul 5: Integrare
- [ ] Modifică `ReceiptCreateView.vue`
- [ ] Adaugă auto-fill din OCR result
- [ ] Adaugă feedback vizual
### Pasul 6: Testing
- [ ] Testează pe sample bonuri românești
- [ ] Ajustează regex patterns
- [ ] Optimizează preprocessing
---
## Referințe Fișiere Existente
- `data-entry-app/backend/app/services/receipt_service.py` - Pattern servicii
- `data-entry-app/backend/app/db/crud/attachment.py` - File handling
- `data-entry-app/backend/app/schemas/receipt.py` - Schema patterns
- `data-entry-app/backend/app/db/models/receipt.py` - Receipt model
- `data-entry-app/frontend/src/views/receipts/ReceiptCreateView.vue` - View de modificat
- `data-entry-app/CLAUDE.md` - Instrucțiuni specifice data-entry

View File

@@ -80,13 +80,14 @@ data/uploads/
│ │ Vue.js │ │ FastAPI │ │ (staging) │ │
│ │ :3010 │ │ :8003 │ │ │ │
│ └──────────────┘ └──────┬───────┘ └──────────────┘ │
│ │
│ Nomenclatoare │
▼ │
┌──────────────┐ │
│ Oracle │ │
│ (read-only) │ │
└──────────────┘ │
│ │
│ OCR Upload │ Nomenclatoare │
▼ │
┌──────────────┐ ┌──────────────┐ │
│ OCR Service │ │ Oracle │ │
│ PaddleOCR │ (read-only) │ │
│ +Tesseract └──────────────┘ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
```
@@ -258,18 +259,109 @@ JWT_SECRET_KEY=***
JWT_ALGORITHM=HS256
```
## OCR Processing Pipeline
### 5. OCR Architecture
**Alegere**: PaddleOCR (primar) + Tesseract (fallback), procesare 100% locala
**Motivatie**:
- Zero costuri externe (fara API-uri cloud)
- Procesare on-premise (date sensibile raman locale)
- PaddleOCR: acuratete ridicata, CPU-friendly
- Tesseract: suport excelent pentru diacritice romanesti
**Stack OCR**:
```
┌─────────────────────────────────────────────────────┐
│ OCR Pipeline │
├─────────────────────────────────────────────────────┤
│ │
│ Image Upload → ImagePreprocessor → OCREngine │
│ │ │ │ │
│ │ ▼ ▼ │
│ │ ┌─────────┐ ┌──────────────┐ │
│ │ │ OpenCV │ │ PaddleOCR │ │
│ │ │ Pipeline│ │ (primary) │ │
│ │ └─────────┘ └──────┬───────┘ │
│ │ │ │ │
│ │ │ fallback│ │
│ │ │ ▼ │
│ │ │ ┌──────────────┐ │
│ │ │ │ Tesseract │ │
│ │ │ │ (ron+eng) │ │
│ │ │ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────┐ │
│ │ ReceiptExtractor (Regex) │ │
│ │ - Amount patterns (TOTAL, DE PLATA) │ │
│ │ - Date patterns (DD.MM.YYYY) │ │
│ │ - CUI patterns (C.U.I., C.I.F.) │ │
│ │ - Vendor extraction (first lines) │ │
│ └──────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ExtractionResult + Confidence │
│ │
└─────────────────────────────────────────────────────┘
```
### Image Preprocessing Pipeline
```python
def preprocess(image):
1. Convert to grayscale
2. Resize if width < 1000px (upscale for better OCR)
3. Deskew using Hough lines (straighten rotated text)
4. Denoise (Non-local means denoising)
5. Adaptive thresholding (binarization)
6. Morphological close (connect broken characters)
return processed_image
```
### Extraction Patterns (Romanian Receipts)
| Pattern Type | Regex Examples | Confidence |
|--------------|----------------|------------|
| Amount | `TOTAL\s*:?\s*([\d.,]+)` | 0.95 |
| Date | `(\d{2}[./]\d{2}[./]\d{4})` | 0.90 |
| CUI | `C\.?U\.?I\.?\s*:?\s*(\d{6,10})` | 0.95 |
| Receipt Number | `NR\.?\s*BON\s*:?\s*(\d+)` | 0.95 |
| Vendor | First 5 non-keyword lines | 0.70 |
### OCR API Endpoints
```
GET /api/ocr/status # Check OCR availability
POST /api/ocr/extract # Extract from uploaded image
POST /api/ocr/extract-attachment/{id} # Re-process existing attachment
```
### System Dependencies
```bash
# Ubuntu/Debian
apt-get install -y \
tesseract-ocr tesseract-ocr-ron tesseract-ocr-eng \
poppler-utils libgl1-mesa-glx libglib2.0-0
```
## Testing Strategy
### Unit Tests
- CRUD operations
- Workflow transitions
- Entry generation logic
- OCR extraction patterns
### Integration Tests
- API endpoints
- File upload/download
- Oracle nomenclature fetch
- OCR endpoint with sample receipts
### E2E Tests
- Complete workflow: create → submit → approve
- File upload cu preview
- OCR extraction → form auto-fill

View File

@@ -3,6 +3,7 @@
## Obiectiv
Sistem de introducere bonuri fiscale cu:
- **OCR automat** pentru extragerea datelor din poze bonuri (100% local, fara costuri)
- **Upload poze** bonuri de la utilizatori
- **Generare automata** note contabile (staging area)
- **Aprobare de contabil** inainte de finalizare
@@ -13,8 +14,10 @@ Sistem de introducere bonuri fiscale cu:
### 1. Gestiune Bonuri Fiscale
#### 1.1 Creare Bon
- Utilizatorul poate uploada o poza a bonului fiscal
#### 1.1 Creare Bon cu OCR
- Utilizatorul uploadeaza poza bonului fiscal
- **OCR extrage automat**: suma, data, furnizor, CUI, numar bon
- Utilizatorul verifica si corecteaza datele extrase
- Campuri obligatorii: tip document, directie, data, suma, furnizor, casa/banca
- Campuri optionale: numar bon, serie, descriere
- Tipuri document: Bon Fiscal, Chitanta
@@ -145,11 +148,71 @@ GET /api/receipts/cash-registers # Case/Banci
GET /api/receipts/expense-types # Tipuri cheltuieli
```
### OCR
```
GET /api/ocr/status # Verifica disponibilitate OCR
POST /api/ocr/extract # Extrage date din imagine uploadata
POST /api/ocr/extract-attachment/{id} # Re-proceseaza atasament existent
```
## OCR - Specificatii Tehnice
### Cerinte OCR
- **100% local** - fara costuri externe, fara API-uri cloud
- **Full-auto** - completeaza formularul automat
- **Input**: doar imagini (JPG/PNG/PDF)
- **On-premise** - datele sensibile raman locale
### Campuri Extrase Automat
| Camp | Tip | Acuratete Estimata |
|------|-----|-------------------|
| Suma (TOTAL) | Decimal | 90-95% |
| Data bon | Date | 85-90% |
| Numar bon | String | 80-85% |
| Furnizor | String | 70-80% |
| CUI | String | 85-90% |
| Tip document | Enum | 95%+ |
### Stack Tehnic OCR
| Component | Solutie | Justificare |
|-----------|---------|-------------|
| **OCR Engine** | PaddleOCR (primar) | 85-92% acuratete, pip install, CPU-friendly |
| **Fallback OCR** | Tesseract + ron | Suport excelent diacritice romanesti |
| **Extractie** | Regex/rules-based | Zero dependente extra, rapid (<100ms) |
| **Preprocessing** | OpenCV | Deskew, binarizare, denoise |
| **PDF → Image** | pdf2image + Poppler | Standard, fiabil |
### Dependente Sistem (Linux)
```bash
apt-get install -y \
tesseract-ocr tesseract-ocr-ron tesseract-ocr-eng \
poppler-utils libgl1-mesa-glx libglib2.0-0
```
### User Flow OCR
```
1. User deschide "Bon Fiscal Nou"
2. User trage/selecteaza poza bonului
3. Click "Proceseaza cu OCR"
4. [Spinner 2-3 sec] "Se proceseaza imaginea..."
5. Apare preview cu date extrase + indicatori incredere
6. User click "Aplica datele" sau corecteaza manual
7. Formularul se completeaza automat
8. User selecteaza tip cheltuiala, casa de marcat
9. User salveaza draft sau trimite pentru aprobare
```
## Criterii de Succes (Faza 1)
- [ ] Utilizator poate uploada poza bon + date de baza
- [ ] Sistem genereaza automat note contabile
- [ ] Contabil poate vedea, edita si aproba note
- [ ] Bonurile aprobate sunt vizibile in lista
- [ ] Migrarile Alembic functioneaza corect
- [ ] Poze bonuri se salveaza si se afiseaza corect
- [x] Utilizator poate uploada poza bon + date de baza
- [x] **OCR extrage automat date din poza bonului**
- [x] **Indicatori de incredere pentru date extrase**
- [x] Sistem genereaza automat note contabile
- [x] Contabil poate vedea, edita si aproba note
- [x] Bonurile aprobate sunt vizibile in lista
- [x] Migrarile Alembic functioneaza corect
- [x] Poze bonuri se salveaza si se afiseaza corect