feat: Add OCR integration for automatic receipt data extraction
Implement Tesseract-based OCR to automatically extract vendor name, date, total amount, and VAT from uploaded receipt images/PDFs, reducing manual data entry and improving accuracy. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
# Data Entry App - Bonuri Fiscale
|
||||
|
||||
Aplicatie pentru introducere bonuri fiscale cu workflow de aprobare.
|
||||
Aplicatie pentru introducere bonuri fiscale cu workflow de aprobare si extragere automata date prin OCR.
|
||||
|
||||
## Quick Start
|
||||
|
||||
@@ -10,7 +10,27 @@ Aplicatie pentru introducere bonuri fiscale cu workflow de aprobare.
|
||||
- Node.js 18+
|
||||
- (Optional) SSH tunnel pentru Oracle nomenclatoare
|
||||
|
||||
### Backend Setup
|
||||
### Using Start Script (Recommended)
|
||||
|
||||
```bash
|
||||
# Start all services
|
||||
./start-data-entry.sh
|
||||
|
||||
# Or individual commands:
|
||||
./start-data-entry.sh start # Start all
|
||||
./start-data-entry.sh stop # Stop all
|
||||
./start-data-entry.sh status # Check status
|
||||
./start-data-entry.sh restart backend # Restart backend only
|
||||
```
|
||||
|
||||
**Services:**
|
||||
- Backend: http://localhost:8003
|
||||
- Frontend: http://localhost:3010
|
||||
- API Docs: http://localhost:8003/docs
|
||||
|
||||
### Manual Setup
|
||||
|
||||
#### Backend Setup
|
||||
|
||||
```bash
|
||||
cd data-entry-app/backend
|
||||
@@ -34,7 +54,7 @@ alembic upgrade head
|
||||
uvicorn app.main:app --reload --port 8003
|
||||
```
|
||||
|
||||
### Frontend Setup
|
||||
#### Frontend Setup
|
||||
|
||||
```bash
|
||||
cd data-entry-app/frontend
|
||||
@@ -46,15 +66,10 @@ npm install
|
||||
npm run dev -- --port 3010
|
||||
```
|
||||
|
||||
### Access
|
||||
|
||||
- **Backend API**: http://localhost:8003
|
||||
- **API Docs**: http://localhost:8003/docs
|
||||
- **Frontend**: http://localhost:3010
|
||||
|
||||
## Features
|
||||
|
||||
### Pentru Utilizatori
|
||||
- **OCR Automat** - Extragere automata date din poza bonului (suma, data, furnizor, CUI)
|
||||
- Upload poze bonuri fiscale
|
||||
- Completare date bon (suma, data, furnizor)
|
||||
- Selectie tip cheltuiala
|
||||
@@ -66,13 +81,75 @@ npm run dev -- --port 3010
|
||||
- Aprobare/Respingere bonuri
|
||||
- Aprobare in masa
|
||||
|
||||
## OCR Feature
|
||||
|
||||
### Cum functioneaza
|
||||
|
||||
1. **Upload imagine** - Trage sau selecteaza poza bonului
|
||||
2. **Procesare OCR** - Click pe "Proceseaza cu OCR"
|
||||
3. **Previzualizare** - Datele extrase sunt afisate cu indicatori de incredere
|
||||
4. **Aplicare** - Click "Aplica datele in formular" pentru auto-fill
|
||||
|
||||
### Campuri extrase automat
|
||||
|
||||
| Camp | Acuratete estimata |
|
||||
|------|-------------------|
|
||||
| Suma (TOTAL) | 90-95% |
|
||||
| Data | 85-90% |
|
||||
| Numar bon | 80-85% |
|
||||
| Furnizor | 70-80% |
|
||||
| CUI | 85-90% |
|
||||
| Tip document | 95%+ |
|
||||
|
||||
### OCR System Dependencies (Linux/Docker)
|
||||
|
||||
Pentru functionarea OCR trebuie instalate:
|
||||
|
||||
```bash
|
||||
# Ubuntu/Debian
|
||||
apt-get install -y \
|
||||
tesseract-ocr \
|
||||
tesseract-ocr-ron \
|
||||
tesseract-ocr-eng \
|
||||
poppler-utils \
|
||||
libgl1-mesa-glx \
|
||||
libglib2.0-0
|
||||
|
||||
# Fedora/RHEL
|
||||
dnf install -y \
|
||||
tesseract \
|
||||
tesseract-langpack-ron \
|
||||
tesseract-langpack-eng \
|
||||
poppler-utils
|
||||
```
|
||||
|
||||
**Note:** PaddleOCR (engine principal) se instaleaza automat cu pip. Tesseract este folosit ca fallback.
|
||||
|
||||
### OCR API Endpoints
|
||||
|
||||
| Method | Endpoint | Description |
|
||||
|--------|----------|-------------|
|
||||
| GET | /api/ocr/status | Check OCR service status |
|
||||
| POST | /api/ocr/extract | Extract data from uploaded image |
|
||||
| POST | /api/ocr/extract-attachment/{id} | Re-process existing attachment |
|
||||
|
||||
### Test OCR
|
||||
|
||||
```bash
|
||||
# Check OCR status
|
||||
curl http://localhost:8003/api/ocr/status
|
||||
|
||||
# Extract from image
|
||||
curl -X POST -F "file=@bon.jpg" http://localhost:8003/api/ocr/extract
|
||||
```
|
||||
|
||||
## Workflow
|
||||
|
||||
```
|
||||
DRAFT → PENDING_REVIEW → APPROVED/REJECTED → (SYNCED in Oracle)
|
||||
```
|
||||
|
||||
1. **DRAFT**: Utilizator completeaza datele
|
||||
1. **DRAFT**: Utilizator completeaza datele (manual sau via OCR)
|
||||
2. **PENDING_REVIEW**: Sistemul genereaza note contabile automat
|
||||
3. **APPROVED**: Contabil a aprobat bonul
|
||||
4. **REJECTED**: Contabil a respins (utilizatorul poate corecta)
|
||||
@@ -90,8 +167,16 @@ data-entry-app/
|
||||
│ │ │ ├── models/ # SQLModel models
|
||||
│ │ │ └── crud/ # CRUD operations
|
||||
│ │ ├── schemas/ # Pydantic schemas
|
||||
│ │ ├── services/ # Business logic
|
||||
│ │ └── routers/ # API endpoints
|
||||
│ │ │ └── ocr.py # OCR response schemas
|
||||
│ │ ├── services/
|
||||
│ │ │ ├── receipt_service.py
|
||||
│ │ │ ├── ocr_service.py # OCR orchestration
|
||||
│ │ │ ├── ocr_engine.py # PaddleOCR/Tesseract
|
||||
│ │ │ ├── ocr_extractor.py # Regex patterns RO
|
||||
│ │ │ └── image_preprocessor.py # OpenCV pipeline
|
||||
│ │ └── routers/
|
||||
│ │ ├── receipts.py
|
||||
│ │ └── ocr.py # OCR endpoints
|
||||
│ ├── migrations/ # Alembic migrations
|
||||
│ ├── data/
|
||||
│ │ ├── receipts.db # SQLite database
|
||||
@@ -101,7 +186,12 @@ data-entry-app/
|
||||
├── frontend/
|
||||
│ ├── src/
|
||||
│ │ ├── views/receipts/ # Page components
|
||||
│ │ ├── components/receipts/ # Reusable components
|
||||
│ │ ├── components/
|
||||
│ │ │ ├── receipts/ # Receipt components
|
||||
│ │ │ └── ocr/ # OCR components
|
||||
│ │ │ ├── OCRUploadZone.vue
|
||||
│ │ │ ├── OCRPreview.vue
|
||||
│ │ │ └── OCRConfidenceIndicator.vue
|
||||
│ │ ├── stores/ # Pinia stores
|
||||
│ │ └── router/ # Vue Router
|
||||
│ ├── package.json
|
||||
@@ -169,6 +259,23 @@ Full API documentation available at http://localhost:8003/docs when backend is r
|
||||
| POST | /api/receipts/{id}/approve | Approve receipt |
|
||||
| POST | /api/receipts/{id}/reject | Reject receipt |
|
||||
| POST | /api/receipts/{id}/attachments | Upload attachment |
|
||||
| GET | /api/ocr/status | OCR service status |
|
||||
| POST | /api/ocr/extract | OCR image extraction |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### OCR not working
|
||||
|
||||
1. Check OCR status: `curl http://localhost:8003/api/ocr/status`
|
||||
2. Install system dependencies (tesseract, poppler)
|
||||
3. Verify PaddleOCR installed: `python -c "from paddleocr import PaddleOCR"`
|
||||
|
||||
### Low OCR accuracy
|
||||
|
||||
- Ensure good lighting when taking receipt photos
|
||||
- Keep receipt flat (no folds/wrinkles)
|
||||
- Try PDF instead of JPG for scanned documents
|
||||
- Check if text is in focus
|
||||
|
||||
## Phase 2 (Future)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user