feat: Add OCR integration for automatic receipt data extraction

Implement Tesseract-based OCR to automatically extract vendor name,
date, total amount, and VAT from uploaded receipt images/PDFs,
reducing manual data entry and improving accuracy.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2025-12-12 11:48:29 +02:00
parent 5960154094
commit 41ae97180e
16 changed files with 2773 additions and 32 deletions

View File

@@ -80,13 +80,14 @@ data/uploads/
│ │ Vue.js │ │ FastAPI │ │ (staging) │ │
│ │ :3010 │ │ :8003 │ │ │ │
│ └──────────────┘ └──────┬───────┘ └──────────────┘ │
│ │
│ Nomenclatoare │
▼ │
┌──────────────┐ │
│ Oracle │ │
│ (read-only) │ │
└──────────────┘ │
│ │
│ OCR Upload │ Nomenclatoare │
▼ │
┌──────────────┐ ┌──────────────┐ │
│ OCR Service │ │ Oracle │ │
│ PaddleOCR │ (read-only) │ │
│ +Tesseract └──────────────┘ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
```
@@ -258,18 +259,109 @@ JWT_SECRET_KEY=***
JWT_ALGORITHM=HS256
```
## OCR Processing Pipeline
### 5. OCR Architecture
**Alegere**: PaddleOCR (primar) + Tesseract (fallback), procesare 100% locala
**Motivatie**:
- Zero costuri externe (fara API-uri cloud)
- Procesare on-premise (date sensibile raman locale)
- PaddleOCR: acuratete ridicata, CPU-friendly
- Tesseract: suport excelent pentru diacritice romanesti
**Stack OCR**:
```
┌─────────────────────────────────────────────────────┐
│ OCR Pipeline │
├─────────────────────────────────────────────────────┤
│ │
│ Image Upload → ImagePreprocessor → OCREngine │
│ │ │ │ │
│ │ ▼ ▼ │
│ │ ┌─────────┐ ┌──────────────┐ │
│ │ │ OpenCV │ │ PaddleOCR │ │
│ │ │ Pipeline│ │ (primary) │ │
│ │ └─────────┘ └──────┬───────┘ │
│ │ │ │ │
│ │ │ fallback│ │
│ │ │ ▼ │
│ │ │ ┌──────────────┐ │
│ │ │ │ Tesseract │ │
│ │ │ │ (ron+eng) │ │
│ │ │ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────┐ │
│ │ ReceiptExtractor (Regex) │ │
│ │ - Amount patterns (TOTAL, DE PLATA) │ │
│ │ - Date patterns (DD.MM.YYYY) │ │
│ │ - CUI patterns (C.U.I., C.I.F.) │ │
│ │ - Vendor extraction (first lines) │ │
│ └──────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ExtractionResult + Confidence │
│ │
└─────────────────────────────────────────────────────┘
```
### Image Preprocessing Pipeline
```python
def preprocess(image):
1. Convert to grayscale
2. Resize if width < 1000px (upscale for better OCR)
3. Deskew using Hough lines (straighten rotated text)
4. Denoise (Non-local means denoising)
5. Adaptive thresholding (binarization)
6. Morphological close (connect broken characters)
return processed_image
```
### Extraction Patterns (Romanian Receipts)
| Pattern Type | Regex Examples | Confidence |
|--------------|----------------|------------|
| Amount | `TOTAL\s*:?\s*([\d.,]+)` | 0.95 |
| Date | `(\d{2}[./]\d{2}[./]\d{4})` | 0.90 |
| CUI | `C\.?U\.?I\.?\s*:?\s*(\d{6,10})` | 0.95 |
| Receipt Number | `NR\.?\s*BON\s*:?\s*(\d+)` | 0.95 |
| Vendor | First 5 non-keyword lines | 0.70 |
### OCR API Endpoints
```
GET /api/ocr/status # Check OCR availability
POST /api/ocr/extract # Extract from uploaded image
POST /api/ocr/extract-attachment/{id} # Re-process existing attachment
```
### System Dependencies
```bash
# Ubuntu/Debian
apt-get install -y \
tesseract-ocr tesseract-ocr-ron tesseract-ocr-eng \
poppler-utils libgl1-mesa-glx libglib2.0-0
```
## Testing Strategy
### Unit Tests
- CRUD operations
- Workflow transitions
- Entry generation logic
- OCR extraction patterns
### Integration Tests
- API endpoints
- File upload/download
- Oracle nomenclature fetch
- OCR endpoint with sample receipts
### E2E Tests
- Complete workflow: create → submit → approve
- File upload cu preview
- OCR extraction → form auto-fill

View File

@@ -3,6 +3,7 @@
## Obiectiv
Sistem de introducere bonuri fiscale cu:
- **OCR automat** pentru extragerea datelor din poze bonuri (100% local, fara costuri)
- **Upload poze** bonuri de la utilizatori
- **Generare automata** note contabile (staging area)
- **Aprobare de contabil** inainte de finalizare
@@ -13,8 +14,10 @@ Sistem de introducere bonuri fiscale cu:
### 1. Gestiune Bonuri Fiscale
#### 1.1 Creare Bon
- Utilizatorul poate uploada o poza a bonului fiscal
#### 1.1 Creare Bon cu OCR
- Utilizatorul uploadeaza poza bonului fiscal
- **OCR extrage automat**: suma, data, furnizor, CUI, numar bon
- Utilizatorul verifica si corecteaza datele extrase
- Campuri obligatorii: tip document, directie, data, suma, furnizor, casa/banca
- Campuri optionale: numar bon, serie, descriere
- Tipuri document: Bon Fiscal, Chitanta
@@ -145,11 +148,71 @@ GET /api/receipts/cash-registers # Case/Banci
GET /api/receipts/expense-types # Tipuri cheltuieli
```
### OCR
```
GET /api/ocr/status # Verifica disponibilitate OCR
POST /api/ocr/extract # Extrage date din imagine uploadata
POST /api/ocr/extract-attachment/{id} # Re-proceseaza atasament existent
```
## OCR - Specificatii Tehnice
### Cerinte OCR
- **100% local** - fara costuri externe, fara API-uri cloud
- **Full-auto** - completeaza formularul automat
- **Input**: doar imagini (JPG/PNG/PDF)
- **On-premise** - datele sensibile raman locale
### Campuri Extrase Automat
| Camp | Tip | Acuratete Estimata |
|------|-----|-------------------|
| Suma (TOTAL) | Decimal | 90-95% |
| Data bon | Date | 85-90% |
| Numar bon | String | 80-85% |
| Furnizor | String | 70-80% |
| CUI | String | 85-90% |
| Tip document | Enum | 95%+ |
### Stack Tehnic OCR
| Component | Solutie | Justificare |
|-----------|---------|-------------|
| **OCR Engine** | PaddleOCR (primar) | 85-92% acuratete, pip install, CPU-friendly |
| **Fallback OCR** | Tesseract + ron | Suport excelent diacritice romanesti |
| **Extractie** | Regex/rules-based | Zero dependente extra, rapid (<100ms) |
| **Preprocessing** | OpenCV | Deskew, binarizare, denoise |
| **PDF → Image** | pdf2image + Poppler | Standard, fiabil |
### Dependente Sistem (Linux)
```bash
apt-get install -y \
tesseract-ocr tesseract-ocr-ron tesseract-ocr-eng \
poppler-utils libgl1-mesa-glx libglib2.0-0
```
### User Flow OCR
```
1. User deschide "Bon Fiscal Nou"
2. User trage/selecteaza poza bonului
3. Click "Proceseaza cu OCR"
4. [Spinner 2-3 sec] "Se proceseaza imaginea..."
5. Apare preview cu date extrase + indicatori incredere
6. User click "Aplica datele" sau corecteaza manual
7. Formularul se completeaza automat
8. User selecteaza tip cheltuiala, casa de marcat
9. User salveaza draft sau trimite pentru aprobare
```
## Criterii de Succes (Faza 1)
- [ ] Utilizator poate uploada poza bon + date de baza
- [ ] Sistem genereaza automat note contabile
- [ ] Contabil poate vedea, edita si aproba note
- [ ] Bonurile aprobate sunt vizibile in lista
- [ ] Migrarile Alembic functioneaza corect
- [ ] Poze bonuri se salveaza si se afiseaza corect
- [x] Utilizator poate uploada poza bon + date de baza
- [x] **OCR extrage automat date din poza bonului**
- [x] **Indicatori de incredere pentru date extrase**
- [x] Sistem genereaza automat note contabile
- [x] Contabil poate vedea, edita si aproba note
- [x] Bonurile aprobate sunt vizibile in lista
- [x] Migrarile Alembic functioneaza corect
- [x] Poze bonuri se salveaza si se afiseaza corect