Files
roa2web-service-auto/docs/data-entry
Claude Agent 28f259cd05 fix(ocr): Fix store profile extraction patterns and module loading
Major fixes to OCR store profiles for Romanian receipt extraction:

- Fix ProfileRegistry module path resolution (was loading 0 profiles)
- Add multiline TVA extraction for Brick, Electrobering, Gama Ink
- Add "CARTE CREDIT" payment detection for OMV/SOCAR gas stations
- Handle OCR artifacts: TVA→TUA, "-"→"4", I→L in CUI markers
- Add client CUI patterns for Brick receipts
- Add profile selection logging to ocr_extractor.py
- Create test script for all 29 PDFs (test_all_profiles.py)

Test results: 13/29 passing (improved from 9/29)
Remaining failures are primarily OCR quality issues.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-07 09:40:58 +00:00
..

Data Entry App - Bonuri Fiscale

Aplicatie pentru introducere bonuri fiscale cu workflow de aprobare si extragere automata date prin OCR.

Quick Start

Prerequisites

  • Python 3.10+
  • Node.js 18+
  • (Optional) SSH tunnel pentru Oracle nomenclatoare
# Start all services
./start-data-entry.sh

# Or individual commands:
./start-data-entry.sh start              # Start all
./start-data-entry.sh stop               # Stop all
./start-data-entry.sh status             # Check status
./start-data-entry.sh restart backend    # Restart backend only

Services:

Manual Setup

Backend Setup

cd backend/modules/data_entry/backend

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# sau: venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# Create .env file
cp .env.example .env
# Edit .env with your settings

# Run migrations
alembic upgrade head

# Start server
uvicorn app.main:app --reload --port 8003

Frontend Setup

cd backend/modules/data_entry/frontend

# Install dependencies
npm install

# Start dev server
npm run dev -- --port 3010

Features

Pentru Utilizatori

  • OCR Automat - Extragere automata date din poza bonului (suma, data, furnizor, CUI)
  • Upload poze bonuri fiscale
  • Completare date bon (suma, data, furnizor)
  • Selectie tip cheltuiala
  • Trimitere spre aprobare

Pentru Contabili

  • Vizualizare bonuri in asteptare
  • Editare note contabile propuse
  • Aprobare/Respingere bonuri
  • Aprobare in masa

OCR Feature

Cum functioneaza

  1. Upload imagine - Trage sau selecteaza poza bonului
  2. Procesare OCR - Click pe "Proceseaza cu OCR"
  3. Previzualizare - Datele extrase sunt afisate cu indicatori de incredere
  4. Aplicare - Click "Aplica datele in formular" pentru auto-fill

Campuri extrase automat

Camp Acuratete estimata
Suma (TOTAL) 90-95%
Data 85-90%
Numar bon 80-85%
Furnizor 70-80%
CUI 85-90%
Tip document 95%+

OCR System Dependencies (Linux/Docker)

Pentru functionarea OCR trebuie instalate:

# Ubuntu/Debian
apt-get install -y \
    tesseract-ocr \
    tesseract-ocr-ron \
    tesseract-ocr-eng \
    poppler-utils \
    libgl1-mesa-glx \
    libglib2.0-0

# Fedora/RHEL
dnf install -y \
    tesseract \
    tesseract-langpack-ron \
    tesseract-langpack-eng \
    poppler-utils

Note: PaddleOCR (engine principal) se instaleaza automat cu pip. Tesseract este folosit ca fallback.

OCR System Dependencies (Windows)

Pe Windows Server trebuie instalate manual urmatoarele componente:

1. Poppler (pentru conversie PDF → imagini)

# Descarca Poppler pentru Windows
# https://github.com/osborn/poppler-windows/releases
# sau https://github.com/bblanchon/pdfium-binaries

# Extrage in C:\Program Files\poppler\
# Adauga la PATH: C:\Program Files\poppler\Library\bin

2. Tesseract OCR (engine OCR backup)

# Descarca installer de la:
# https://github.com/UB-Mannheim/tesseract/wiki

# Instaleaza cu limbile: English + Romanian
# Default path: C:\Program Files\Tesseract-OCR\
# Adauga la PATH

3. Python OCR Dependencies (in venv)

cd C:\inetpub\wwwroot\roa2web\data-entry-backend
.\venv\Scripts\activate

# Instaleaza dependentele OCR
pip install paddlepaddle>=2.5.0
pip install paddleocr>=2.7.0
pip install opencv-python>=4.8.0
pip install pytesseract>=0.3.10
pip install pdf2image>=1.16.0

# Sau din requirements.txt
pip install -r requirements.txt

4. Restart serviciu

nssm restart ROA2WEB-DataEntry

Note importante Windows:

  • Prima rulare PaddleOCR descarca modele (~200MB) - poate dura cateva minute
  • PaddleOCR necesita ~2GB RAM disponibil
  • Verifica PATH-ul pentru Poppler si Tesseract dupa instalare
  • Restart serviciul backend dupa orice modificare PATH

OCR API Endpoints

Method Endpoint Description
GET /api/ocr/status Check OCR service status
POST /api/ocr/extract Extract data from uploaded image
POST /api/ocr/extract-attachment/{id} Re-process existing attachment

Test OCR

# Check OCR status
curl http://localhost:8003/api/ocr/status

# Extract from image
curl -X POST -F "file=@bon.jpg" http://localhost:8003/api/ocr/extract

Workflow

DRAFT → PENDING_REVIEW → APPROVED/REJECTED → (SYNCED in Oracle)
  1. DRAFT: Utilizator completeaza datele (manual sau via OCR)
  2. PENDING_REVIEW: Sistemul genereaza note contabile automat
  3. APPROVED: Contabil a aprobat bonul
  4. REJECTED: Contabil a respins (utilizatorul poate corecta)

Project Structure

backend/modules/data_entry/
├── backend/
│   ├── app/
│   │   ├── main.py              # FastAPI entry point
│   │   ├── config.py            # Settings
│   │   ├── db/
│   │   │   ├── database.py      # SQLite engine
│   │   │   ├── models/          # SQLModel models
│   │   │   └── crud/            # CRUD operations
│   │   ├── schemas/             # Pydantic schemas
│   │   │   └── ocr.py           # OCR response schemas
│   │   ├── services/
│   │   │   ├── receipt_service.py
│   │   │   ├── ocr_service.py       # OCR orchestration
│   │   │   ├── ocr_engine.py        # PaddleOCR/Tesseract
│   │   │   ├── ocr_extractor.py     # Regex patterns RO
│   │   │   └── image_preprocessor.py # OpenCV pipeline
│   │   └── routers/
│   │       ├── receipts.py
│   │       └── ocr.py           # OCR endpoints
│   ├── migrations/              # Alembic migrations
│   ├── data/
│   │   ├── receipts.db          # SQLite database
│   │   └── uploads/             # Uploaded files
│   └── requirements.txt
│
├── frontend/
│   ├── src/
│   │   ├── views/receipts/      # Page components
│   │   ├── components/
│   │   │   ├── receipts/        # Receipt components
│   │   │   └── ocr/             # OCR components
│   │   │       ├── OCRUploadZone.vue
│   │   │       ├── OCRPreview.vue
│   │   │       └── OCRConfidenceIndicator.vue
│   │   ├── stores/              # Pinia stores
│   │   └── router/              # Vue Router
│   ├── package.json
│   └── vite.config.js
│
└── docs/                        # Documentation

Environment Variables

Backend (.env)

# SQLite
SQLITE_DATABASE_PATH=data/receipts.db

# File uploads
UPLOAD_PATH=data/uploads
MAX_UPLOAD_SIZE_MB=10

# Oracle (for nomenclatures)
ORACLE_USER=CONTAFIN_ORACLE
ORACLE_PASSWORD=your_password
ORACLE_HOST=localhost
ORACLE_PORT=1526
ORACLE_SID=ROA

# JWT (shared with Reports module)
JWT_SECRET_KEY=your_secret_key
JWT_ALGORITHM=HS256

Development

Create new migration

cd backend
alembic revision --autogenerate -m "Add new field"
alembic upgrade head

Run tests

# Backend
cd backend && pytest

# Frontend
cd frontend && npm run test

API Documentation

Full API documentation available at http://localhost:8003/docs when backend is running.

Key Endpoints

Method Endpoint Description
POST /api/receipts/ Create receipt
GET /api/receipts/ List receipts
GET /api/receipts/{id} Get receipt details
POST /api/receipts/{id}/submit Submit for review
POST /api/receipts/{id}/approve Approve receipt
POST /api/receipts/{id}/reject Reject receipt
POST /api/receipts/{id}/attachments Upload attachment
GET /api/ocr/status OCR service status
POST /api/ocr/extract OCR image extraction

Troubleshooting

OCR not working

  1. Check OCR status: curl http://localhost:8003/api/ocr/status
  2. Install system dependencies (tesseract, poppler)
  3. Verify PaddleOCR installed: python -c "from paddleocr import PaddleOCR"

OCR Windows - "poppler not in PATH"

# Eroare: "Unable to get page count. Is poppler installed and in PATH?"

# Solutie 1: Adauga Poppler la PATH
# System Properties → Environment Variables → System variables → Path → New
# Adauga: C:\Program Files\poppler\Library\bin

# Solutie 2: Restart serviciul dupa modificarea PATH
nssm restart ROA2WEB-DataEntry

# Verificare:
pdfinfo --version

OCR Windows - "tesseract not found"

# Eroare: "tesseract is not installed or it's not in your PATH"

# Solutie: Adauga Tesseract la PATH
# C:\Program Files\Tesseract-OCR\

# Verificare:
tesseract --version
tesseract --list-langs  # Trebuie sa arate 'ron' si 'eng'

OCR Windows - PaddleOCR import error

# Eroare: "No module named 'paddleocr'"

cd C:\inetpub\wwwroot\roa2web\data-entry-backend
.\venv\Scripts\activate
pip install paddlepaddle>=2.5.0
pip install paddleocr>=2.7.0

# Restart serviciu
nssm restart ROA2WEB-DataEntry

Low OCR accuracy

  • Ensure good lighting when taking receipt photos
  • Keep receipt flat (no folds/wrinkles)
  • Try PDF instead of JPG for scanned documents
  • Check if text is in focus

Phase 2 (Future)

  • Oracle sync for approved receipts
  • Integration with pack_contafin procedures
  • Automatic posting to ACT/RUL tables