feat: Add OCR integration for automatic receipt data extraction

Implement Tesseract-based OCR to automatically extract vendor name, date, total amount, and VAT from uploaded receipt images/PDFs, reducing manual data entry and improving accuracy. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 11:48:29 +02:00
parent 5960154094
commit 41ae97180e
16 changed files with 2773 additions and 32 deletions
--- a/docs/data-entry/ARCHITECTURE.md
+++ b/docs/data-entry/ARCHITECTURE.md
@@ -80,13 +80,14 @@ data/uploads/
 │  │   Vue.js     │     │   FastAPI    │     │   (staging)  │   │
 │  │   :3010      │     │   :8003      │     │              │   │
 │  └──────────────┘     └──────┬───────┘     └──────────────┘   │
-│                              │                                  │
-│                              │ Nomenclatoare                    │
-│                              ▼                                  │
-│                       ┌──────────────┐                         │
-│                       │   Oracle     │                         │
-│                       │ (read-only)  │                         │
-│                       └──────────────┘                         │
+│        │                     │                                  │
+│        │ OCR Upload          │ Nomenclatoare                    │
+│        ▼                     ▼                                  │
+│  ┌──────────────┐     ┌──────────────┐                         │
+│  │  OCR Service │     │   Oracle     │                         │
+│  │  PaddleOCR   │     │ (read-only)  │                         │
+│  │  +Tesseract  │     └──────────────┘                         │
+│  └──────────────┘                                               │
 │                                                                 │
 └─────────────────────────────────────────────────────────────────┘
 ```
@@ -258,18 +259,109 @@ JWT_SECRET_KEY=***
 JWT_ALGORITHM=HS256
 ```

+## OCR Processing Pipeline
+
+### 5. OCR Architecture
+
+**Alegere**: PaddleOCR (primar) + Tesseract (fallback), procesare 100% locala
+
+**Motivatie**:
+- Zero costuri externe (fara API-uri cloud)
+- Procesare on-premise (date sensibile raman locale)
+- PaddleOCR: acuratete ridicata, CPU-friendly
+- Tesseract: suport excelent pentru diacritice romanesti
+
+**Stack OCR**:
+```
+┌─────────────────────────────────────────────────────┐
+│                   OCR Pipeline                       │
+├─────────────────────────────────────────────────────┤
+│                                                      │
+│  Image Upload → ImagePreprocessor → OCREngine        │
+│       │              │                  │            │
+│       │              ▼                  ▼            │
+│       │         ┌─────────┐      ┌──────────────┐   │
+│       │         │ OpenCV  │      │ PaddleOCR    │   │
+│       │         │ Pipeline│      │ (primary)    │   │
+│       │         └─────────┘      └──────┬───────┘   │
+│       │              │                  │            │
+│       │              │           fallback│            │
+│       │              │                  ▼            │
+│       │              │           ┌──────────────┐   │
+│       │              │           │ Tesseract    │   │
+│       │              │           │ (ron+eng)    │   │
+│       │              │           └──────────────┘   │
+│       │              │                  │            │
+│       ▼              ▼                  ▼            │
+│  ┌──────────────────────────────────────────────┐   │
+│  │           ReceiptExtractor (Regex)           │   │
+│  │  - Amount patterns (TOTAL, DE PLATA)         │   │
+│  │  - Date patterns (DD.MM.YYYY)                │   │
+│  │  - CUI patterns (C.U.I., C.I.F.)             │   │
+│  │  - Vendor extraction (first lines)           │   │
+│  └──────────────────────────────────────────────┘   │
+│                        │                             │
+│                        ▼                             │
+│              ExtractionResult + Confidence           │
+│                                                      │
+└─────────────────────────────────────────────────────┘
+```
+
+### Image Preprocessing Pipeline
+
+```python
+def preprocess(image):
+    1. Convert to grayscale
+    2. Resize if width < 1000px (upscale for better OCR)
+    3. Deskew using Hough lines (straighten rotated text)
+    4. Denoise (Non-local means denoising)
+    5. Adaptive thresholding (binarization)
+    6. Morphological close (connect broken characters)
+    return processed_image
+```
+
+### Extraction Patterns (Romanian Receipts)
+
+| Pattern Type | Regex Examples | Confidence |
+|--------------|----------------|------------|
+| Amount | `TOTAL\s*:?\s*([\d.,]+)` | 0.95 |
+| Date | `(\d{2}[./]\d{2}[./]\d{4})` | 0.90 |
+| CUI | `C\.?U\.?I\.?\s*:?\s*(\d{6,10})` | 0.95 |
+| Receipt Number | `NR\.?\s*BON\s*:?\s*(\d+)` | 0.95 |
+| Vendor | First 5 non-keyword lines | 0.70 |
+
+### OCR API Endpoints
+
+```
+GET  /api/ocr/status                      # Check OCR availability
+POST /api/ocr/extract                     # Extract from uploaded image
+POST /api/ocr/extract-attachment/{id}     # Re-process existing attachment
+```
+
+### System Dependencies
+
+```bash
+# Ubuntu/Debian
+apt-get install -y \
+    tesseract-ocr tesseract-ocr-ron tesseract-ocr-eng \
+    poppler-utils libgl1-mesa-glx libglib2.0-0
+```
+
 ## Testing Strategy

 ### Unit Tests
 - CRUD operations
 - Workflow transitions
 - Entry generation logic
+- OCR extraction patterns

 ### Integration Tests
 - API endpoints
 - File upload/download
 - Oracle nomenclature fetch
+- OCR endpoint with sample receipts

 ### E2E Tests
 - Complete workflow: create → submit → approve
 - File upload cu preview
+- OCR extraction → form auto-fill
--- a/docs/data-entry/REQUIREMENTS.md
+++ b/docs/data-entry/REQUIREMENTS.md
@@ -3,6 +3,7 @@
 ## Obiectiv

 Sistem de introducere bonuri fiscale cu:
+- **OCR automat** pentru extragerea datelor din poze bonuri (100% local, fara costuri)
 - **Upload poze** bonuri de la utilizatori
 - **Generare automata** note contabile (staging area)
 - **Aprobare de contabil** inainte de finalizare
@@ -13,8 +14,10 @@ Sistem de introducere bonuri fiscale cu:

 ### 1. Gestiune Bonuri Fiscale

-#### 1.1 Creare Bon
- Utilizatorul poate uploada o poza a bonului fiscal
+#### 1.1 Creare Bon cu OCR
+- Utilizatorul uploadeaza poza bonului fiscal
+- **OCR extrage automat**: suma, data, furnizor, CUI, numar bon
+- Utilizatorul verifica si corecteaza datele extrase
 - Campuri obligatorii: tip document, directie, data, suma, furnizor, casa/banca
 - Campuri optionale: numar bon, serie, descriere
 - Tipuri document: Bon Fiscal, Chitanta
@@ -145,11 +148,71 @@ GET    /api/receipts/cash-registers      # Case/Banci
 GET    /api/receipts/expense-types       # Tipuri cheltuieli
 ```

+### OCR
+```
+GET    /api/ocr/status                   # Verifica disponibilitate OCR
+POST   /api/ocr/extract                  # Extrage date din imagine uploadata
+POST   /api/ocr/extract-attachment/{id}  # Re-proceseaza atasament existent
+```
+
+## OCR - Specificatii Tehnice
+
+### Cerinte OCR
+- **100% local** - fara costuri externe, fara API-uri cloud
+- **Full-auto** - completeaza formularul automat
+- **Input**: doar imagini (JPG/PNG/PDF)
+- **On-premise** - datele sensibile raman locale
+
+### Campuri Extrase Automat
+
+| Camp | Tip | Acuratete Estimata |
+|------|-----|-------------------|
+| Suma (TOTAL) | Decimal | 90-95% |
+| Data bon | Date | 85-90% |
+| Numar bon | String | 80-85% |
+| Furnizor | String | 70-80% |
+| CUI | String | 85-90% |
+| Tip document | Enum | 95%+ |
+
+### Stack Tehnic OCR
+
+| Component | Solutie | Justificare |
+|-----------|---------|-------------|
+| **OCR Engine** | PaddleOCR (primar) | 85-92% acuratete, pip install, CPU-friendly |
+| **Fallback OCR** | Tesseract + ron | Suport excelent diacritice romanesti |
+| **Extractie** | Regex/rules-based | Zero dependente extra, rapid (<100ms) |
+| **Preprocessing** | OpenCV | Deskew, binarizare, denoise |
+| **PDF → Image** | pdf2image + Poppler | Standard, fiabil |
+
+### Dependente Sistem (Linux)
+
+```bash
+apt-get install -y \
+    tesseract-ocr tesseract-ocr-ron tesseract-ocr-eng \
+    poppler-utils libgl1-mesa-glx libglib2.0-0
+```
+
+### User Flow OCR
+
+```
+1. User deschide "Bon Fiscal Nou"
+2. User trage/selecteaza poza bonului
+3. Click "Proceseaza cu OCR"
+4. [Spinner 2-3 sec] "Se proceseaza imaginea..."
+5. Apare preview cu date extrase + indicatori incredere
+6. User click "Aplica datele" sau corecteaza manual
+7. Formularul se completeaza automat
+8. User selecteaza tip cheltuiala, casa de marcat
+9. User salveaza draft sau trimite pentru aprobare
+```
+
 ## Criterii de Succes (Faza 1)

- [ ] Utilizator poate uploada poza bon + date de baza
- [ ] Sistem genereaza automat note contabile
- [ ] Contabil poate vedea, edita si aproba note
- [ ] Bonurile aprobate sunt vizibile in lista
- [ ] Migrarile Alembic functioneaza corect
- [ ] Poze bonuri se salveaza si se afiseaza corect
+- [x] Utilizator poate uploada poza bon + date de baza
+- [x] **OCR extrage automat date din poza bonului**
+- [x] **Indicatori de incredere pentru date extrase**
+- [x] Sistem genereaza automat note contabile
+- [x] Contabil poate vedea, edita si aproba note
+- [x] Bonurile aprobate sunt vizibile in lista
+- [x] Migrarile Alembic functioneaza corect
+- [x] Poze bonuri se salveaza si se afiseaza corect