feat: Add OCR integration for automatic receipt data extraction
Implement Tesseract-based OCR to automatically extract vendor name, date, total amount, and VAT from uploaded receipt images/PDFs, reducing manual data entry and improving accuracy. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -80,13 +80,14 @@ data/uploads/
|
||||
│ │ Vue.js │ │ FastAPI │ │ (staging) │ │
|
||||
│ │ :3010 │ │ :8003 │ │ │ │
|
||||
│ └──────────────┘ └──────┬───────┘ └──────────────┘ │
|
||||
│ │ │
|
||||
│ │ Nomenclatoare │
|
||||
│ ▼ │
|
||||
│ ┌──────────────┐ │
|
||||
│ │ Oracle │ │
|
||||
│ │ (read-only) │ │
|
||||
│ └──────────────┘ │
|
||||
│ │ │ │
|
||||
│ │ OCR Upload │ Nomenclatoare │
|
||||
│ ▼ ▼ │
|
||||
│ ┌──────────────┐ ┌──────────────┐ │
|
||||
│ │ OCR Service │ │ Oracle │ │
|
||||
│ │ PaddleOCR │ │ (read-only) │ │
|
||||
│ │ +Tesseract │ └──────────────┘ │
|
||||
│ └──────────────┘ │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
@@ -258,18 +259,109 @@ JWT_SECRET_KEY=***
|
||||
JWT_ALGORITHM=HS256
|
||||
```
|
||||
|
||||
## OCR Processing Pipeline
|
||||
|
||||
### 5. OCR Architecture
|
||||
|
||||
**Alegere**: PaddleOCR (primar) + Tesseract (fallback), procesare 100% locala
|
||||
|
||||
**Motivatie**:
|
||||
- Zero costuri externe (fara API-uri cloud)
|
||||
- Procesare on-premise (date sensibile raman locale)
|
||||
- PaddleOCR: acuratete ridicata, CPU-friendly
|
||||
- Tesseract: suport excelent pentru diacritice romanesti
|
||||
|
||||
**Stack OCR**:
|
||||
```
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ OCR Pipeline │
|
||||
├─────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ Image Upload → ImagePreprocessor → OCREngine │
|
||||
│ │ │ │ │
|
||||
│ │ ▼ ▼ │
|
||||
│ │ ┌─────────┐ ┌──────────────┐ │
|
||||
│ │ │ OpenCV │ │ PaddleOCR │ │
|
||||
│ │ │ Pipeline│ │ (primary) │ │
|
||||
│ │ └─────────┘ └──────┬───────┘ │
|
||||
│ │ │ │ │
|
||||
│ │ │ fallback│ │
|
||||
│ │ │ ▼ │
|
||||
│ │ │ ┌──────────────┐ │
|
||||
│ │ │ │ Tesseract │ │
|
||||
│ │ │ │ (ron+eng) │ │
|
||||
│ │ │ └──────────────┘ │
|
||||
│ │ │ │ │
|
||||
│ ▼ ▼ ▼ │
|
||||
│ ┌──────────────────────────────────────────────┐ │
|
||||
│ │ ReceiptExtractor (Regex) │ │
|
||||
│ │ - Amount patterns (TOTAL, DE PLATA) │ │
|
||||
│ │ - Date patterns (DD.MM.YYYY) │ │
|
||||
│ │ - CUI patterns (C.U.I., C.I.F.) │ │
|
||||
│ │ - Vendor extraction (first lines) │ │
|
||||
│ └──────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ExtractionResult + Confidence │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Image Preprocessing Pipeline
|
||||
|
||||
```python
|
||||
def preprocess(image):
|
||||
1. Convert to grayscale
|
||||
2. Resize if width < 1000px (upscale for better OCR)
|
||||
3. Deskew using Hough lines (straighten rotated text)
|
||||
4. Denoise (Non-local means denoising)
|
||||
5. Adaptive thresholding (binarization)
|
||||
6. Morphological close (connect broken characters)
|
||||
return processed_image
|
||||
```
|
||||
|
||||
### Extraction Patterns (Romanian Receipts)
|
||||
|
||||
| Pattern Type | Regex Examples | Confidence |
|
||||
|--------------|----------------|------------|
|
||||
| Amount | `TOTAL\s*:?\s*([\d.,]+)` | 0.95 |
|
||||
| Date | `(\d{2}[./]\d{2}[./]\d{4})` | 0.90 |
|
||||
| CUI | `C\.?U\.?I\.?\s*:?\s*(\d{6,10})` | 0.95 |
|
||||
| Receipt Number | `NR\.?\s*BON\s*:?\s*(\d+)` | 0.95 |
|
||||
| Vendor | First 5 non-keyword lines | 0.70 |
|
||||
|
||||
### OCR API Endpoints
|
||||
|
||||
```
|
||||
GET /api/ocr/status # Check OCR availability
|
||||
POST /api/ocr/extract # Extract from uploaded image
|
||||
POST /api/ocr/extract-attachment/{id} # Re-process existing attachment
|
||||
```
|
||||
|
||||
### System Dependencies
|
||||
|
||||
```bash
|
||||
# Ubuntu/Debian
|
||||
apt-get install -y \
|
||||
tesseract-ocr tesseract-ocr-ron tesseract-ocr-eng \
|
||||
poppler-utils libgl1-mesa-glx libglib2.0-0
|
||||
```
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
- CRUD operations
|
||||
- Workflow transitions
|
||||
- Entry generation logic
|
||||
- OCR extraction patterns
|
||||
|
||||
### Integration Tests
|
||||
- API endpoints
|
||||
- File upload/download
|
||||
- Oracle nomenclature fetch
|
||||
- OCR endpoint with sample receipts
|
||||
|
||||
### E2E Tests
|
||||
- Complete workflow: create → submit → approve
|
||||
- File upload cu preview
|
||||
- OCR extraction → form auto-fill
|
||||
|
||||
Reference in New Issue
Block a user