feat: Add OCR integration for automatic receipt data extraction

Implement Tesseract-based OCR to automatically extract vendor name, date, total amount, and VAT from uploaded receipt images/PDFs, reducing manual data entry and improving accuracy. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 11:48:29 +02:00
parent 5960154094
commit 41ae97180e
16 changed files with 2773 additions and 32 deletions
--- a/docs/data-entry/ARCHITECTURE.md
+++ b/docs/data-entry/ARCHITECTURE.md
@@ -80,13 +80,14 @@ data/uploads/
 │  │   Vue.js     │     │   FastAPI    │     │   (staging)  │   │
 │  │   :3010      │     │   :8003      │     │              │   │
 │  └──────────────┘     └──────┬───────┘     └──────────────┘   │
-│                              │                                  │
-│                              │ Nomenclatoare                    │
-│                              ▼                                  │
-│                       ┌──────────────┐                         │
-│                       │   Oracle     │                         │
-│                       │ (read-only)  │                         │
-│                       └──────────────┘                         │
+│        │                     │                                  │
+│        │ OCR Upload          │ Nomenclatoare                    │
+│        ▼                     ▼                                  │
+│  ┌──────────────┐     ┌──────────────┐                         │
+│  │  OCR Service │     │   Oracle     │                         │
+│  │  PaddleOCR   │     │ (read-only)  │                         │
+│  │  +Tesseract  │     └──────────────┘                         │
+│  └──────────────┘                                               │
 │                                                                 │
 └─────────────────────────────────────────────────────────────────┘
 ```
@@ -258,18 +259,109 @@ JWT_SECRET_KEY=***
 JWT_ALGORITHM=HS256
 ```

+## OCR Processing Pipeline
+
+### 5. OCR Architecture
+
+**Alegere**: PaddleOCR (primar) + Tesseract (fallback), procesare 100% locala
+
+**Motivatie**:
+- Zero costuri externe (fara API-uri cloud)
+- Procesare on-premise (date sensibile raman locale)
+- PaddleOCR: acuratete ridicata, CPU-friendly
+- Tesseract: suport excelent pentru diacritice romanesti
+
+**Stack OCR**:
+```
+┌─────────────────────────────────────────────────────┐
+│                   OCR Pipeline                       │
+├─────────────────────────────────────────────────────┤
+│                                                      │
+│  Image Upload → ImagePreprocessor → OCREngine        │
+│       │              │                  │            │
+│       │              ▼                  ▼            │
+│       │         ┌─────────┐      ┌──────────────┐   │
+│       │         │ OpenCV  │      │ PaddleOCR    │   │
+│       │         │ Pipeline│      │ (primary)    │   │
+│       │         └─────────┘      └──────┬───────┘   │
+│       │              │                  │            │
+│       │              │           fallback│            │
+│       │              │                  ▼            │
+│       │              │           ┌──────────────┐   │
+│       │              │           │ Tesseract    │   │
+│       │              │           │ (ron+eng)    │   │
+│       │              │           └──────────────┘   │
+│       │              │                  │            │
+│       ▼              ▼                  ▼            │
+│  ┌──────────────────────────────────────────────┐   │
+│  │           ReceiptExtractor (Regex)           │   │
+│  │  - Amount patterns (TOTAL, DE PLATA)         │   │
+│  │  - Date patterns (DD.MM.YYYY)                │   │
+│  │  - CUI patterns (C.U.I., C.I.F.)             │   │
+│  │  - Vendor extraction (first lines)           │   │
+│  └──────────────────────────────────────────────┘   │
+│                        │                             │
+│                        ▼                             │
+│              ExtractionResult + Confidence           │
+│                                                      │
+└─────────────────────────────────────────────────────┘
+```
+
+### Image Preprocessing Pipeline
+
+```python
+def preprocess(image):
+    1. Convert to grayscale
+    2. Resize if width < 1000px (upscale for better OCR)
+    3. Deskew using Hough lines (straighten rotated text)
+    4. Denoise (Non-local means denoising)
+    5. Adaptive thresholding (binarization)
+    6. Morphological close (connect broken characters)
+    return processed_image
+```
+
+### Extraction Patterns (Romanian Receipts)
+
+| Pattern Type | Regex Examples | Confidence |
+|--------------|----------------|------------|
+| Amount | `TOTAL\s*:?\s*([\d.,]+)` | 0.95 |
+| Date | `(\d{2}[./]\d{2}[./]\d{4})` | 0.90 |
+| CUI | `C\.?U\.?I\.?\s*:?\s*(\d{6,10})` | 0.95 |
+| Receipt Number | `NR\.?\s*BON\s*:?\s*(\d+)` | 0.95 |
+| Vendor | First 5 non-keyword lines | 0.70 |
+
+### OCR API Endpoints
+
+```
+GET  /api/ocr/status                      # Check OCR availability
+POST /api/ocr/extract                     # Extract from uploaded image
+POST /api/ocr/extract-attachment/{id}     # Re-process existing attachment
+```
+
+### System Dependencies
+
+```bash
+# Ubuntu/Debian
+apt-get install -y \
+    tesseract-ocr tesseract-ocr-ron tesseract-ocr-eng \
+    poppler-utils libgl1-mesa-glx libglib2.0-0
+```
+
 ## Testing Strategy

 ### Unit Tests
 - CRUD operations
 - Workflow transitions
 - Entry generation logic
+- OCR extraction patterns

 ### Integration Tests
 - API endpoints
 - File upload/download
 - Oracle nomenclature fetch
+- OCR endpoint with sample receipts

 ### E2E Tests
 - Complete workflow: create → submit → approve
 - File upload cu preview
+- OCR extraction → form auto-fill