feat: Add OCR integration for automatic receipt data extraction

Implement Tesseract-based OCR to automatically extract vendor name,
date, total amount, and VAT from uploaded receipt images/PDFs,
reducing manual data entry and improving accuracy.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2025-12-12 11:48:29 +02:00
parent 5960154094
commit 41ae97180e
16 changed files with 2773 additions and 32 deletions

View File

@@ -80,13 +80,14 @@ data/uploads/
│ │ Vue.js │ │ FastAPI │ │ (staging) │ │
│ │ :3010 │ │ :8003 │ │ │ │
│ └──────────────┘ └──────┬───────┘ └──────────────┘ │
│ │
│ Nomenclatoare │
▼ │
┌──────────────┐ │
│ Oracle │ │
│ (read-only) │ │
└──────────────┘ │
│ │
│ OCR Upload │ Nomenclatoare │
▼ │
┌──────────────┐ ┌──────────────┐ │
│ OCR Service │ │ Oracle │ │
│ PaddleOCR │ (read-only) │ │
│ +Tesseract └──────────────┘ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
```
@@ -258,18 +259,109 @@ JWT_SECRET_KEY=***
JWT_ALGORITHM=HS256
```
## OCR Processing Pipeline
### 5. OCR Architecture
**Alegere**: PaddleOCR (primar) + Tesseract (fallback), procesare 100% locala
**Motivatie**:
- Zero costuri externe (fara API-uri cloud)
- Procesare on-premise (date sensibile raman locale)
- PaddleOCR: acuratete ridicata, CPU-friendly
- Tesseract: suport excelent pentru diacritice romanesti
**Stack OCR**:
```
┌─────────────────────────────────────────────────────┐
│ OCR Pipeline │
├─────────────────────────────────────────────────────┤
│ │
│ Image Upload → ImagePreprocessor → OCREngine │
│ │ │ │ │
│ │ ▼ ▼ │
│ │ ┌─────────┐ ┌──────────────┐ │
│ │ │ OpenCV │ │ PaddleOCR │ │
│ │ │ Pipeline│ │ (primary) │ │
│ │ └─────────┘ └──────┬───────┘ │
│ │ │ │ │
│ │ │ fallback│ │
│ │ │ ▼ │
│ │ │ ┌──────────────┐ │
│ │ │ │ Tesseract │ │
│ │ │ │ (ron+eng) │ │
│ │ │ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────┐ │
│ │ ReceiptExtractor (Regex) │ │
│ │ - Amount patterns (TOTAL, DE PLATA) │ │
│ │ - Date patterns (DD.MM.YYYY) │ │
│ │ - CUI patterns (C.U.I., C.I.F.) │ │
│ │ - Vendor extraction (first lines) │ │
│ └──────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ExtractionResult + Confidence │
│ │
└─────────────────────────────────────────────────────┘
```
### Image Preprocessing Pipeline
```python
def preprocess(image):
1. Convert to grayscale
2. Resize if width < 1000px (upscale for better OCR)
3. Deskew using Hough lines (straighten rotated text)
4. Denoise (Non-local means denoising)
5. Adaptive thresholding (binarization)
6. Morphological close (connect broken characters)
return processed_image
```
### Extraction Patterns (Romanian Receipts)
| Pattern Type | Regex Examples | Confidence |
|--------------|----------------|------------|
| Amount | `TOTAL\s*:?\s*([\d.,]+)` | 0.95 |
| Date | `(\d{2}[./]\d{2}[./]\d{4})` | 0.90 |
| CUI | `C\.?U\.?I\.?\s*:?\s*(\d{6,10})` | 0.95 |
| Receipt Number | `NR\.?\s*BON\s*:?\s*(\d+)` | 0.95 |
| Vendor | First 5 non-keyword lines | 0.70 |
### OCR API Endpoints
```
GET /api/ocr/status # Check OCR availability
POST /api/ocr/extract # Extract from uploaded image
POST /api/ocr/extract-attachment/{id} # Re-process existing attachment
```
### System Dependencies
```bash
# Ubuntu/Debian
apt-get install -y \
tesseract-ocr tesseract-ocr-ron tesseract-ocr-eng \
poppler-utils libgl1-mesa-glx libglib2.0-0
```
## Testing Strategy
### Unit Tests
- CRUD operations
- Workflow transitions
- Entry generation logic
- OCR extraction patterns
### Integration Tests
- API endpoints
- File upload/download
- Oracle nomenclature fetch
- OCR endpoint with sample receipts
### E2E Tests
- Complete workflow: create → submit → approve
- File upload cu preview
- OCR extraction → form auto-fill