Implement Tesseract-based OCR to automatically extract vendor name, date, total amount, and VAT from uploaded receipt images/PDFs, reducing manual data entry and improving accuracy. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
718 lines
22 KiB
Markdown
718 lines
22 KiB
Markdown
# OCR Implementation Plan - Data Entry App
|
|
|
|
> **Context Handover Document**
|
|
> Created: 2025-12-11
|
|
> Branch: `feature/data-entry-receipts`
|
|
> Status: Ready for implementation
|
|
|
|
## Executive Summary
|
|
|
|
Implementare OCR 100% local (fără costuri externe) pentru extragerea automată a datelor din bonuri fiscale/chitanțe românești. Soluția folosește PaddleOCR + regex extraction cu full-auto completion a formularului.
|
|
|
|
**Cerințe utilizator:**
|
|
- Open-source local, fără costuri externe
|
|
- Full-auto: completează formularul automat
|
|
- Input: doar imagini (JPG/PNG/PDF)
|
|
- On-premise processing
|
|
|
|
---
|
|
|
|
## Stack Tehnic Recomandat
|
|
|
|
| Component | Soluție | Justificare |
|
|
|-----------|---------|-------------|
|
|
| **OCR Engine** | PaddleOCR (primar) | 85-92% acuratețe, pip install simplu, CPU-friendly |
|
|
| **Fallback OCR** | Tesseract + ron | Suport excelent diacritice românești |
|
|
| **Extracție** | Regex/rules-based | Zero dependențe extra, rapid (<100ms), deterministic |
|
|
| **Preprocessing** | OpenCV | Deskew, binarizare, denoise - esențial pentru bonuri termice |
|
|
| **PDF → Image** | pdf2image + Poppler | Standard, fiabil |
|
|
|
|
---
|
|
|
|
## Fișiere de Creat
|
|
|
|
### Backend (Noi)
|
|
```
|
|
data-entry-app/backend/app/
|
|
├── services/
|
|
│ ├── ocr_service.py # Orchestrare OCR (async)
|
|
│ ├── ocr_engine.py # Wrapper PaddleOCR + Tesseract
|
|
│ ├── ocr_extractor.py # Regex patterns pentru bonuri RO
|
|
│ └── image_preprocessor.py # OpenCV pipeline
|
|
├── schemas/
|
|
│ └── ocr.py # ExtractionData, OCRResponse
|
|
└── routers/
|
|
└── ocr.py # POST /api/ocr/extract
|
|
```
|
|
|
|
### Frontend (Noi)
|
|
```
|
|
data-entry-app/frontend/src/components/ocr/
|
|
├── OCRUploadZone.vue # Drag-drop + trigger OCR
|
|
├── OCRPreview.vue # Preview date extrase
|
|
└── OCRConfidenceIndicator.vue # Indicator vizual încredere
|
|
```
|
|
|
|
### Modificări la fișiere existente
|
|
- `data-entry-app/backend/requirements.txt` - adaugă dependențe OCR
|
|
- `data-entry-app/backend/app/main.py` - include OCR router
|
|
- `data-entry-app/frontend/src/views/receipts/ReceiptCreateView.vue` - integrare OCR
|
|
|
|
---
|
|
|
|
## Câmpuri de Extras (din Receipt model)
|
|
|
|
Câmpurile țintă pentru OCR extraction (vezi `data-entry-app/backend/app/db/models/receipt.py`):
|
|
|
|
| Câmp | Tip | Acuratețe estimată |
|
|
|------|-----|-------------------|
|
|
| `receipt_type` | Enum: BON_FISCAL, CHITANTA | 95%+ |
|
|
| `receipt_number` | String (max 50) | 80-85% |
|
|
| `receipt_date` | Date | 85-90% |
|
|
| `amount` | Decimal(2) | 90-95% |
|
|
| `partner_name` | String (max 200) | 70-80% |
|
|
| `cui` | String (fiscal code) | 85-90% |
|
|
|
|
---
|
|
|
|
## API Design
|
|
|
|
### `POST /api/ocr/extract`
|
|
|
|
**Input**: `multipart/form-data` cu fișier (JPG/PNG/PDF, max 10MB)
|
|
|
|
**Output**:
|
|
```json
|
|
{
|
|
"success": true,
|
|
"message": "OCR processing successful",
|
|
"data": {
|
|
"receipt_type": "bon_fiscal",
|
|
"receipt_number": "12345",
|
|
"receipt_series": null,
|
|
"receipt_date": "2024-01-15",
|
|
"amount": 125.50,
|
|
"partner_name": "MEGA IMAGE SRL",
|
|
"cui": "12345678",
|
|
"description": null,
|
|
"confidence_amount": 0.95,
|
|
"confidence_date": 0.90,
|
|
"confidence_vendor": 0.75,
|
|
"overall_confidence": 0.87,
|
|
"raw_text": "BON FISCAL\nMEGA IMAGE SRL\n..."
|
|
}
|
|
}
|
|
```
|
|
|
|
### `POST /api/ocr/extract-attachment/{attachment_id}`
|
|
Re-procesează un attachment existent.
|
|
|
|
---
|
|
|
|
## Implementare Detaliată
|
|
|
|
### 1. Image Preprocessor (`image_preprocessor.py`)
|
|
|
|
```python
|
|
"""Image preprocessing for optimal OCR results."""
|
|
from pathlib import Path
|
|
from typing import List
|
|
import numpy as np
|
|
import cv2
|
|
|
|
try:
|
|
import pdf2image
|
|
PDF_AVAILABLE = True
|
|
except ImportError:
|
|
PDF_AVAILABLE = False
|
|
|
|
|
|
class ImagePreprocessor:
|
|
"""Preprocess receipt images for OCR."""
|
|
|
|
def load_image(self, path: Path) -> np.ndarray:
|
|
"""Load image from file."""
|
|
image = cv2.imread(str(path))
|
|
if image is None:
|
|
raise ValueError(f"Could not load image: {path}")
|
|
return image
|
|
|
|
def pdf_to_images(self, path: Path, dpi: int = 300) -> List[np.ndarray]:
|
|
"""Convert PDF to images."""
|
|
if not PDF_AVAILABLE:
|
|
raise RuntimeError("pdf2image not available")
|
|
images = pdf2image.convert_from_path(str(path), dpi=dpi)
|
|
return [np.array(img) for img in images]
|
|
|
|
def preprocess(self, image: np.ndarray) -> np.ndarray:
|
|
"""
|
|
Apply preprocessing pipeline for thermal receipt images.
|
|
|
|
Pipeline:
|
|
1. Convert to grayscale
|
|
2. Resize if too small (min 1000px width)
|
|
3. Deskew (straighten rotated text)
|
|
4. Denoise (Non-local means)
|
|
5. Adaptive thresholding (binarization)
|
|
6. Morphological close (connect broken chars)
|
|
"""
|
|
# 1. Grayscale
|
|
if len(image.shape) == 3:
|
|
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
|
|
else:
|
|
gray = image.copy()
|
|
|
|
# 2. Resize if too small
|
|
height, width = gray.shape
|
|
if width < 1000:
|
|
scale = 1000 / width
|
|
gray = cv2.resize(gray, None, fx=scale, fy=scale,
|
|
interpolation=cv2.INTER_CUBIC)
|
|
|
|
# 3. Deskew
|
|
gray = self._deskew(gray)
|
|
|
|
# 4. Denoise
|
|
denoised = cv2.fastNlMeansDenoising(gray, h=10,
|
|
templateWindowSize=7,
|
|
searchWindowSize=21)
|
|
|
|
# 5. Adaptive thresholding
|
|
binary = cv2.adaptiveThreshold(
|
|
denoised, 255,
|
|
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
|
|
cv2.THRESH_BINARY,
|
|
blockSize=15, C=8
|
|
)
|
|
|
|
# 6. Morphological close
|
|
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
|
|
result = cv2.morphologyEx(binary, cv2.MORPH_CLOSE, kernel)
|
|
|
|
return result
|
|
|
|
def _deskew(self, image: np.ndarray) -> np.ndarray:
|
|
"""Correct image rotation/skew using Hough lines."""
|
|
edges = cv2.Canny(image, 50, 150, apertureSize=3)
|
|
lines = cv2.HoughLinesP(edges, 1, np.pi/180,
|
|
threshold=100, minLineLength=100, maxLineGap=10)
|
|
|
|
if lines is None:
|
|
return image
|
|
|
|
angles = []
|
|
for line in lines:
|
|
x1, y1, x2, y2 = line[0]
|
|
angle = np.arctan2(y2 - y1, x2 - x1) * 180 / np.pi
|
|
if abs(angle) < 45:
|
|
angles.append(angle)
|
|
|
|
if not angles:
|
|
return image
|
|
|
|
median_angle = np.median(angles)
|
|
if abs(median_angle) < 0.5:
|
|
return image
|
|
|
|
h, w = image.shape[:2]
|
|
center = (w // 2, h // 2)
|
|
M = cv2.getRotationMatrix2D(center, median_angle, 1.0)
|
|
return cv2.warpAffine(image, M, (w, h),
|
|
flags=cv2.INTER_CUBIC,
|
|
borderMode=cv2.BORDER_REPLICATE)
|
|
```
|
|
|
|
### 2. OCR Engine (`ocr_engine.py`)
|
|
|
|
```python
|
|
"""OCR engine wrapper for PaddleOCR and Tesseract."""
|
|
from dataclasses import dataclass
|
|
from typing import List
|
|
import numpy as np
|
|
|
|
try:
|
|
from paddleocr import PaddleOCR
|
|
PADDLE_AVAILABLE = True
|
|
except ImportError:
|
|
PADDLE_AVAILABLE = False
|
|
|
|
try:
|
|
import pytesseract
|
|
TESSERACT_AVAILABLE = True
|
|
except ImportError:
|
|
TESSERACT_AVAILABLE = False
|
|
|
|
|
|
@dataclass
|
|
class OCRResult:
|
|
"""Raw OCR result."""
|
|
text: str
|
|
confidence: float
|
|
boxes: List[dict]
|
|
|
|
|
|
class OCREngine:
|
|
"""Unified OCR engine with fallback support."""
|
|
|
|
def __init__(self):
|
|
self._paddle = None
|
|
self._init_engines()
|
|
|
|
def _init_engines(self):
|
|
if PADDLE_AVAILABLE:
|
|
self._paddle = PaddleOCR(
|
|
use_angle_cls=True,
|
|
lang='en', # Better for mixed text
|
|
use_gpu=False,
|
|
show_log=False,
|
|
det_db_thresh=0.3,
|
|
det_db_box_thresh=0.5,
|
|
)
|
|
|
|
def recognize(self, image: np.ndarray) -> OCRResult:
|
|
"""Perform OCR on preprocessed image."""
|
|
if PADDLE_AVAILABLE and self._paddle:
|
|
return self._paddle_recognize(image)
|
|
elif TESSERACT_AVAILABLE:
|
|
return self._tesseract_recognize(image)
|
|
else:
|
|
raise RuntimeError("No OCR engine available")
|
|
|
|
def _paddle_recognize(self, image: np.ndarray) -> OCRResult:
|
|
result = self._paddle.ocr(image, cls=True)
|
|
|
|
if not result or not result[0]:
|
|
return OCRResult(text="", confidence=0.0, boxes=[])
|
|
|
|
lines = []
|
|
total_conf = 0.0
|
|
boxes = []
|
|
|
|
for line in result[0]:
|
|
box, (text, conf) = line
|
|
lines.append(text)
|
|
total_conf += conf
|
|
boxes.append({'text': text, 'confidence': conf, 'box': box})
|
|
|
|
avg_conf = total_conf / len(result[0]) if result[0] else 0.0
|
|
return OCRResult(text='\n'.join(lines), confidence=avg_conf, boxes=boxes)
|
|
|
|
def _tesseract_recognize(self, image: np.ndarray) -> OCRResult:
|
|
config = '--psm 6 -l ron+eng'
|
|
text = pytesseract.image_to_string(image, config=config)
|
|
data = pytesseract.image_to_data(image, config=config,
|
|
output_type=pytesseract.Output.DICT)
|
|
confidences = [int(c) for c in data['conf'] if int(c) > 0]
|
|
avg_conf = sum(confidences) / len(confidences) / 100 if confidences else 0.0
|
|
return OCRResult(text=text, confidence=avg_conf, boxes=[])
|
|
```
|
|
|
|
### 3. Receipt Extractor (`ocr_extractor.py`)
|
|
|
|
```python
|
|
"""Extract structured fields from OCR text."""
|
|
import re
|
|
from datetime import date, datetime
|
|
from decimal import Decimal, InvalidOperation
|
|
from typing import Optional, Tuple
|
|
from dataclasses import dataclass
|
|
|
|
|
|
@dataclass
|
|
class ExtractionResult:
|
|
"""Structured extraction result."""
|
|
receipt_type: str = 'bon_fiscal'
|
|
receipt_number: Optional[str] = None
|
|
receipt_series: Optional[str] = None
|
|
receipt_date: Optional[date] = None
|
|
amount: Optional[Decimal] = None
|
|
partner_name: Optional[str] = None
|
|
cui: Optional[str] = None
|
|
description: Optional[str] = None
|
|
|
|
confidence_amount: float = 0.0
|
|
confidence_date: float = 0.0
|
|
confidence_vendor: float = 0.0
|
|
raw_text: str = ""
|
|
|
|
@property
|
|
def overall_confidence(self) -> float:
|
|
weights = {'amount': 0.4, 'date': 0.3, 'vendor': 0.3}
|
|
return round(
|
|
self.confidence_amount * weights['amount'] +
|
|
self.confidence_date * weights['date'] +
|
|
self.confidence_vendor * weights['vendor'], 2
|
|
)
|
|
|
|
|
|
class ReceiptExtractor:
|
|
"""Extract receipt fields using pattern matching."""
|
|
|
|
TOTAL_PATTERNS = [
|
|
(r'TOTAL\s*:?\s*([\d\s.,]+)\s*(?:RON|LEI)?', 0.95),
|
|
(r'TOTAL\s+(?:RON|LEI)\s*([\d\s.,]+)', 0.95),
|
|
(r'DE\s+PLATA\s*:?\s*([\d\s.,]+)', 0.90),
|
|
(r'SUMA\s*:?\s*([\d\s.,]+)', 0.85),
|
|
]
|
|
|
|
DATE_PATTERNS = [
|
|
(r'DATA\s*:?\s*(\d{2}[./]\d{2}[./]\d{4})', 0.95),
|
|
(r'(\d{2}[./]\d{2}[./]\d{4})\s+\d{2}:\d{2}', 0.90),
|
|
(r'(\d{2}[./]\d{2}[./]\d{4})', 0.80),
|
|
]
|
|
|
|
NUMBER_PATTERNS = [
|
|
(r'NR\.?\s*BON\s*:?\s*(\d+)', 0.95),
|
|
(r'BON\s+(?:FISCAL\s+)?NR\.?\s*:?\s*(\d+)', 0.95),
|
|
(r'NR\.?\s*:?\s*(\d{4,})', 0.70),
|
|
]
|
|
|
|
CUI_PATTERNS = [
|
|
(r'C\.?U\.?I\.?\s*:?\s*(?:RO)?(\d{6,10})', 0.95),
|
|
(r'C\.?I\.?F\.?\s*:?\s*(?:RO)?(\d{6,10})', 0.95),
|
|
]
|
|
|
|
def extract(self, text: str) -> ExtractionResult:
|
|
result = ExtractionResult()
|
|
text_upper = text.upper()
|
|
|
|
result.amount, result.confidence_amount = self._extract_amount(text_upper)
|
|
result.receipt_date, result.confidence_date = self._extract_date(text_upper)
|
|
result.receipt_number, _ = self._extract_number(text_upper)
|
|
result.partner_name, result.confidence_vendor = self._extract_vendor(text)
|
|
result.cui, _ = self._extract_cui(text_upper)
|
|
|
|
return result
|
|
|
|
def _extract_amount(self, text: str) -> Tuple[Optional[Decimal], float]:
|
|
for pattern, confidence in self.TOTAL_PATTERNS:
|
|
match = re.search(pattern, text, re.IGNORECASE | re.MULTILINE)
|
|
if match:
|
|
try:
|
|
amount_str = re.sub(r'[^\d.,]', '', match.group(1))
|
|
amount_str = amount_str.replace(',', '.')
|
|
parts = amount_str.split('.')
|
|
if len(parts) > 2:
|
|
amount_str = ''.join(parts[:-1]) + '.' + parts[-1]
|
|
amount = Decimal(amount_str)
|
|
if amount > 0:
|
|
return amount, confidence
|
|
except (InvalidOperation, ValueError):
|
|
continue
|
|
return None, 0.0
|
|
|
|
def _extract_date(self, text: str) -> Tuple[Optional[date], float]:
|
|
for pattern, confidence in self.DATE_PATTERNS:
|
|
match = re.search(pattern, text)
|
|
if match:
|
|
try:
|
|
date_str = match.group(1).replace('/', '.')
|
|
parsed = datetime.strptime(date_str, '%d.%m.%Y').date()
|
|
today = date.today()
|
|
if parsed <= today and parsed.year >= 2020:
|
|
return parsed, confidence
|
|
except ValueError:
|
|
continue
|
|
return None, 0.0
|
|
|
|
def _extract_number(self, text: str) -> Tuple[Optional[str], float]:
|
|
for pattern, confidence in self.NUMBER_PATTERNS:
|
|
match = re.search(pattern, text, re.IGNORECASE)
|
|
if match:
|
|
return match.group(1), confidence
|
|
return None, 0.0
|
|
|
|
def _extract_vendor(self, text: str) -> Tuple[Optional[str], float]:
|
|
lines = text.split('\n')
|
|
skip_keywords = ['BON', 'FISCAL', 'TOTAL', 'DATA', 'NR', 'ORA']
|
|
|
|
for i, line in enumerate(lines[:5]):
|
|
line = line.strip()
|
|
if not line or re.match(r'^[\d.,\s]+$', line):
|
|
continue
|
|
if any(kw in line.upper() for kw in skip_keywords):
|
|
continue
|
|
vendor = re.sub(r'[^\w\s.,&-]', '', line).strip()
|
|
if len(vendor) >= 3:
|
|
return vendor, 0.7 - (i * 0.1)
|
|
return None, 0.0
|
|
|
|
def _extract_cui(self, text: str) -> Tuple[Optional[str], float]:
|
|
for pattern, confidence in self.CUI_PATTERNS:
|
|
match = re.search(pattern, text, re.IGNORECASE)
|
|
if match:
|
|
cui = match.group(1)
|
|
if 6 <= len(cui) <= 10:
|
|
return cui, confidence
|
|
return None, 0.0
|
|
```
|
|
|
|
### 4. OCR Service (`ocr_service.py`)
|
|
|
|
```python
|
|
"""Main OCR service coordinating preprocessing, recognition, and extraction."""
|
|
from typing import Optional, Tuple
|
|
from pathlib import Path
|
|
import asyncio
|
|
from concurrent.futures import ThreadPoolExecutor
|
|
|
|
from app.services.ocr_engine import OCREngine
|
|
from app.services.ocr_extractor import ReceiptExtractor, ExtractionResult
|
|
from app.services.image_preprocessor import ImagePreprocessor
|
|
|
|
|
|
class OCRService:
|
|
"""Service for OCR processing of receipt images."""
|
|
|
|
_executor = ThreadPoolExecutor(max_workers=2)
|
|
|
|
def __init__(self):
|
|
self.preprocessor = ImagePreprocessor()
|
|
self.ocr_engine = OCREngine()
|
|
self.extractor = ReceiptExtractor()
|
|
|
|
async def process_image(
|
|
self,
|
|
image_path: Path,
|
|
mime_type: str
|
|
) -> Tuple[bool, str, Optional[ExtractionResult]]:
|
|
"""Process receipt image and extract structured data."""
|
|
try:
|
|
result = await asyncio.get_event_loop().run_in_executor(
|
|
self._executor,
|
|
self._process_sync,
|
|
image_path,
|
|
mime_type
|
|
)
|
|
return result
|
|
except Exception as e:
|
|
return False, f"OCR processing failed: {str(e)}", None
|
|
|
|
def _process_sync(
|
|
self,
|
|
image_path: Path,
|
|
mime_type: str
|
|
) -> Tuple[bool, str, Optional[ExtractionResult]]:
|
|
"""Synchronous processing (runs in thread pool)."""
|
|
|
|
# Handle PDF
|
|
if mime_type == 'application/pdf':
|
|
images = self.preprocessor.pdf_to_images(image_path)
|
|
if not images:
|
|
return False, "Failed to extract images from PDF", None
|
|
image = images[0] # First page only
|
|
else:
|
|
image = self.preprocessor.load_image(image_path)
|
|
|
|
# Preprocess
|
|
processed = self.preprocessor.preprocess(image)
|
|
|
|
# OCR
|
|
ocr_result = self.ocr_engine.recognize(processed)
|
|
if not ocr_result.text:
|
|
return False, "No text detected in image", None
|
|
|
|
# Extract fields
|
|
extraction = self.extractor.extract(ocr_result.text)
|
|
extraction.raw_text = ocr_result.text
|
|
|
|
# Detect receipt type
|
|
text_upper = ocr_result.text.upper()
|
|
if 'CHITANTA' in text_upper or 'CHITANȚĂ' in text_upper:
|
|
extraction.receipt_type = 'chitanta'
|
|
else:
|
|
extraction.receipt_type = 'bon_fiscal'
|
|
|
|
return True, "OCR processing successful", extraction
|
|
```
|
|
|
|
### 5. Schemas (`schemas/ocr.py`)
|
|
|
|
```python
|
|
"""Pydantic schemas for OCR API."""
|
|
from datetime import date
|
|
from decimal import Decimal
|
|
from typing import Optional
|
|
from pydantic import BaseModel, Field
|
|
|
|
|
|
class ExtractionData(BaseModel):
|
|
"""Extracted receipt data."""
|
|
receipt_type: str = Field(default='bon_fiscal')
|
|
receipt_number: Optional[str] = None
|
|
receipt_series: Optional[str] = None
|
|
receipt_date: Optional[date] = None
|
|
amount: Optional[Decimal] = None
|
|
partner_name: Optional[str] = None
|
|
cui: Optional[str] = None
|
|
description: Optional[str] = None
|
|
|
|
confidence_amount: float = Field(default=0.0, ge=0, le=1)
|
|
confidence_date: float = Field(default=0.0, ge=0, le=1)
|
|
confidence_vendor: float = Field(default=0.0, ge=0, le=1)
|
|
overall_confidence: float = Field(default=0.0, ge=0, le=1)
|
|
raw_text: str = Field(default="")
|
|
|
|
|
|
class OCRResponse(BaseModel):
|
|
"""OCR API response."""
|
|
success: bool
|
|
message: str
|
|
data: Optional[ExtractionData] = None
|
|
```
|
|
|
|
### 6. Router (`routers/ocr.py`)
|
|
|
|
```python
|
|
"""OCR API endpoints."""
|
|
from pathlib import Path
|
|
import tempfile
|
|
import os
|
|
|
|
from fastapi import APIRouter, HTTPException, UploadFile, File, Depends
|
|
from sqlalchemy.ext.asyncio import AsyncSession
|
|
|
|
from app.db.database import get_session
|
|
from app.db.crud.attachment import AttachmentCRUD
|
|
from app.services.ocr_service import OCRService
|
|
from app.schemas.ocr import OCRResponse
|
|
|
|
router = APIRouter()
|
|
ocr_service = OCRService()
|
|
|
|
|
|
@router.post("/extract", response_model=OCRResponse)
|
|
async def extract_from_image(file: UploadFile = File(...)):
|
|
"""Extract receipt data from uploaded image."""
|
|
allowed_types = ['image/jpeg', 'image/png', 'application/pdf']
|
|
if file.content_type not in allowed_types:
|
|
raise HTTPException(400, f"File type not supported: {file.content_type}")
|
|
|
|
suffix = Path(file.filename).suffix if file.filename else '.jpg'
|
|
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
|
|
content = await file.read()
|
|
tmp.write(content)
|
|
tmp_path = Path(tmp.name)
|
|
|
|
try:
|
|
success, message, result = await ocr_service.process_image(
|
|
tmp_path, file.content_type
|
|
)
|
|
if not success:
|
|
raise HTTPException(422, message)
|
|
|
|
return OCRResponse(success=True, message=message, data=result)
|
|
finally:
|
|
os.unlink(tmp_path)
|
|
|
|
|
|
@router.post("/extract-attachment/{attachment_id}", response_model=OCRResponse)
|
|
async def extract_from_attachment(
|
|
attachment_id: int,
|
|
session: AsyncSession = Depends(get_session),
|
|
):
|
|
"""Extract receipt data from existing attachment."""
|
|
attachment = await AttachmentCRUD.get_by_id(session, attachment_id)
|
|
if not attachment:
|
|
raise HTTPException(404, "Attachment not found")
|
|
|
|
file_path = AttachmentCRUD.get_file_path(attachment)
|
|
if not file_path.exists():
|
|
raise HTTPException(404, "File not found on disk")
|
|
|
|
success, message, result = await ocr_service.process_image(
|
|
file_path, attachment.mime_type
|
|
)
|
|
if not success:
|
|
raise HTTPException(422, message)
|
|
|
|
return OCRResponse(success=True, message=message, data=result)
|
|
```
|
|
|
|
---
|
|
|
|
## Dependențe
|
|
|
|
### Python (`requirements.txt` - adaugă)
|
|
```
|
|
# OCR Dependencies
|
|
paddleocr>=2.7.0
|
|
paddlepaddle>=2.5.0
|
|
opencv-python>=4.8.0
|
|
pytesseract>=0.3.10
|
|
pdf2image>=1.16.0
|
|
```
|
|
|
|
### Sistem (Linux/Docker)
|
|
```bash
|
|
apt-get install -y \
|
|
tesseract-ocr \
|
|
tesseract-ocr-ron \
|
|
tesseract-ocr-eng \
|
|
poppler-utils \
|
|
libgl1-mesa-glx \
|
|
libglib2.0-0
|
|
```
|
|
|
|
---
|
|
|
|
## User Flow
|
|
|
|
```
|
|
1. User deschide "Bon Fiscal Nou"
|
|
2. User trage/selectează poza bonului în OCRUploadZone
|
|
3. [Spinner 2-3 sec] "Se procesează imaginea..."
|
|
4. Apare OCRPreview cu date extrase + confidence indicators
|
|
5. User click "Aplică datele" sau corectează manual
|
|
6. Formularul se completează automat
|
|
7. User selectează tip cheltuială, casa de marcat
|
|
8. User salvează draft sau trimite pentru aprobare
|
|
```
|
|
|
|
---
|
|
|
|
## Pași Implementare
|
|
|
|
### Pasul 1: Dependențe și setup
|
|
- [ ] Adaugă dependențe în `requirements.txt`
|
|
- [ ] Instalează pachete sistem (tesseract, poppler)
|
|
- [ ] Testează import PaddleOCR
|
|
|
|
### Pasul 2: Backend services
|
|
- [ ] Creează `image_preprocessor.py`
|
|
- [ ] Creează `ocr_engine.py`
|
|
- [ ] Creează `ocr_extractor.py`
|
|
- [ ] Creează `ocr_service.py`
|
|
- [ ] Creează `schemas/ocr.py`
|
|
|
|
### Pasul 3: API endpoint
|
|
- [ ] Creează `routers/ocr.py`
|
|
- [ ] Include router în `main.py`
|
|
- [ ] Testează endpoint
|
|
|
|
### Pasul 4: Frontend components
|
|
- [ ] Creează `OCRUploadZone.vue`
|
|
- [ ] Creează `OCRPreview.vue`
|
|
- [ ] Creează `OCRConfidenceIndicator.vue`
|
|
|
|
### Pasul 5: Integrare
|
|
- [ ] Modifică `ReceiptCreateView.vue`
|
|
- [ ] Adaugă auto-fill din OCR result
|
|
- [ ] Adaugă feedback vizual
|
|
|
|
### Pasul 6: Testing
|
|
- [ ] Testează pe sample bonuri românești
|
|
- [ ] Ajustează regex patterns
|
|
- [ ] Optimizează preprocessing
|
|
|
|
---
|
|
|
|
## Referințe Fișiere Existente
|
|
|
|
- `data-entry-app/backend/app/services/receipt_service.py` - Pattern servicii
|
|
- `data-entry-app/backend/app/db/crud/attachment.py` - File handling
|
|
- `data-entry-app/backend/app/schemas/receipt.py` - Schema patterns
|
|
- `data-entry-app/backend/app/db/models/receipt.py` - Receipt model
|
|
- `data-entry-app/frontend/src/views/receipts/ReceiptCreateView.vue` - View de modificat
|
|
- `data-entry-app/CLAUDE.md` - Instrucțiuni specifice data-entry
|