Files

Claude Agent 28f259cd05 fix(ocr): Fix store profile extraction patterns and module loading

Major fixes to OCR store profiles for Romanian receipt extraction:

- Fix ProfileRegistry module path resolution (was loading 0 profiles)
- Add multiline TVA extraction for Brick, Electrobering, Gama Ink
- Add "CARTE CREDIT" payment detection for OMV/SOCAR gas stations
- Handle OCR artifacts: TVA→TUA, "-"→"4", I→L in CUI markers
- Add client CUI patterns for Brick receipts
- Add profile selection logging to ocr_extractor.py
- Create test script for all 29 PDFs (test_all_profiles.py)

Test results: 13/29 passing (improved from 9/29)
Remaining failures are primarily OCR quality issues.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-07 09:40:58 +00:00

abonament kineterra.pdf

docs(data-entry): Add sample receipt PDFs for OCR testing

2025-12-30 22:04:35 +02:00

ARCHITECTURE.md

chore: Remove obsolete microservices directories and update all references

2025-12-30 12:08:20 +02:00

benzina 07 aug. 2024.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

benzina 10 mai 2025.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

benzina 13 iulie.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

benzina 13 septembrie .pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

benzina 14 august.pdf

docs(data-entry): Add sample receipt PDFs for OCR testing

2025-12-30 22:04:35 +02:00

benzina 20 dec.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

benzina 27 octombrie .pdf

docs(data-entry): Add sample receipt PDFs for OCR testing

2025-12-30 22:04:35 +02:00

best print stampila .pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

bon fiscal Dedeman - efactura.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

brick consumabil 604 50% deductibil 22 dec.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

brick consumabile 604 22 dec.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

brick igiena 1 sept.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

brick igiena 8 octombrie 98.95 lei card.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

brick igiena 604.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

brick igiena, electrice consumabile 604.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

DATA-ENTRY-MODULE.md

feat: Migrate to ultrathin monolith architecture

2025-12-29 23:48:14 +02:00

electrobering igiena iulie 604.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

electrobering telecomanda.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

factura 70005116259 20.09.2025 Dedeman.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

gama ink refill toner imprimanta 17 sept 2024.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

igiena 11 octombrie .pdf

docs(data-entry): Add sample receipt PDFs for OCR testing

2025-12-30 22:04:35 +02:00

igiena 14 decembrie five-holding.pdf

docs(data-entry): Add sample receipt PDFs for OCR testing

2025-12-30 22:04:35 +02:00

kineterra abonament terapie august 2024.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

kineterra fizioterapie 9 sept.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

Lidl papetarie 604 fara TVA. nu are cod fiscal.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

OCR_PROFILE_TEST_RESULTS.md

fix(ocr): Fix store profile extraction patterns and module loading

2026-01-07 09:40:58 +00:00

README.md

chore: Remove obsolete microservices directories and update all references

2025-12-30 12:08:20 +02:00

rechizite 12 decembrie pictus.pdf

docs(data-entry): Add sample receipt PDFs for OCR testing

2025-12-30 22:04:35 +02:00

REQUIREMENTS.md

chore: Remove obsolete microservices directories and update all references

2025-12-30 12:08:20 +02:00

stepout market carti tva 5%.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

unlimited duplicat chei 23 mai.pdf

feat(ocr): Add docTR OCR engine with metrics infrastructure

2026-01-02 05:37:16 +02:00

README.md

Data Entry App - Bonuri Fiscale

Aplicatie pentru introducere bonuri fiscale cu workflow de aprobare si extragere automata date prin OCR.

Quick Start

Prerequisites

Python 3.10+
Node.js 18+
(Optional) SSH tunnel pentru Oracle nomenclatoare

Using Start Script (Recommended)

# Start all services
./start-data-entry.sh

# Or individual commands:
./start-data-entry.sh start              # Start all
./start-data-entry.sh stop               # Stop all
./start-data-entry.sh status             # Check status
./start-data-entry.sh restart backend    # Restart backend only

Services:

Manual Setup

Backend Setup

cd backend/modules/data_entry/backend

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# sau: venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# Create .env file
cp .env.example .env
# Edit .env with your settings

# Run migrations
alembic upgrade head

# Start server
uvicorn app.main:app --reload --port 8003

Frontend Setup

cd backend/modules/data_entry/frontend

# Install dependencies
npm install

# Start dev server
npm run dev -- --port 3010

Features

Pentru Utilizatori

OCR Automat - Extragere automata date din poza bonului (suma, data, furnizor, CUI)
Upload poze bonuri fiscale
Completare date bon (suma, data, furnizor)
Selectie tip cheltuiala
Trimitere spre aprobare

Pentru Contabili

Vizualizare bonuri in asteptare
Editare note contabile propuse
Aprobare/Respingere bonuri
Aprobare in masa

OCR Feature

Cum functioneaza

Upload imagine - Trage sau selecteaza poza bonului
Procesare OCR - Click pe "Proceseaza cu OCR"
Previzualizare - Datele extrase sunt afisate cu indicatori de incredere
Aplicare - Click "Aplica datele in formular" pentru auto-fill

Campuri extrase automat

Camp	Acuratete estimata
Suma (TOTAL)	90-95%
Data	85-90%
Numar bon	80-85%
Furnizor	70-80%
CUI	85-90%
Tip document	95%+

OCR System Dependencies (Linux/Docker)

Pentru functionarea OCR trebuie instalate:

# Ubuntu/Debian
apt-get install -y \
    tesseract-ocr \
    tesseract-ocr-ron \
    tesseract-ocr-eng \
    poppler-utils \
    libgl1-mesa-glx \
    libglib2.0-0

# Fedora/RHEL
dnf install -y \
    tesseract \
    tesseract-langpack-ron \
    tesseract-langpack-eng \
    poppler-utils

Note: PaddleOCR (engine principal) se instaleaza automat cu pip. Tesseract este folosit ca fallback.

OCR System Dependencies (Windows)

Pe Windows Server trebuie instalate manual urmatoarele componente:

1. Poppler (pentru conversie PDF → imagini)

# Descarca Poppler pentru Windows
# https://github.com/osborn/poppler-windows/releases
# sau https://github.com/bblanchon/pdfium-binaries

# Extrage in C:\Program Files\poppler\
# Adauga la PATH: C:\Program Files\poppler\Library\bin

2. Tesseract OCR (engine OCR backup)

# Descarca installer de la:
# https://github.com/UB-Mannheim/tesseract/wiki

# Instaleaza cu limbile: English + Romanian
# Default path: C:\Program Files\Tesseract-OCR\
# Adauga la PATH

3. Python OCR Dependencies (in venv)

cd C:\inetpub\wwwroot\roa2web\data-entry-backend
.\venv\Scripts\activate

# Instaleaza dependentele OCR
pip install paddlepaddle>=2.5.0
pip install paddleocr>=2.7.0
pip install opencv-python>=4.8.0
pip install pytesseract>=0.3.10
pip install pdf2image>=1.16.0

# Sau din requirements.txt
pip install -r requirements.txt

4. Restart serviciu

nssm restart ROA2WEB-DataEntry

Note importante Windows:

Prima rulare PaddleOCR descarca modele (~200MB) - poate dura cateva minute
PaddleOCR necesita ~2GB RAM disponibil
Verifica PATH-ul pentru Poppler si Tesseract dupa instalare
Restart serviciul backend dupa orice modificare PATH

OCR API Endpoints

Method	Endpoint	Description
GET	/api/ocr/status	Check OCR service status
POST	/api/ocr/extract	Extract data from uploaded image
POST	/api/ocr/extract-attachment/{id}	Re-process existing attachment

Test OCR

# Check OCR status
curl http://localhost:8003/api/ocr/status

# Extract from image
curl -X POST -F "file=@bon.jpg" http://localhost:8003/api/ocr/extract

Workflow

DRAFT → PENDING_REVIEW → APPROVED/REJECTED → (SYNCED in Oracle)

DRAFT: Utilizator completeaza datele (manual sau via OCR)
PENDING_REVIEW: Sistemul genereaza note contabile automat
APPROVED: Contabil a aprobat bonul
REJECTED: Contabil a respins (utilizatorul poate corecta)

Project Structure

backend/modules/data_entry/
├── backend/
│   ├── app/
│   │   ├── main.py              # FastAPI entry point
│   │   ├── config.py            # Settings
│   │   ├── db/
│   │   │   ├── database.py      # SQLite engine
│   │   │   ├── models/          # SQLModel models
│   │   │   └── crud/            # CRUD operations
│   │   ├── schemas/             # Pydantic schemas
│   │   │   └── ocr.py           # OCR response schemas
│   │   ├── services/
│   │   │   ├── receipt_service.py
│   │   │   ├── ocr_service.py       # OCR orchestration
│   │   │   ├── ocr_engine.py        # PaddleOCR/Tesseract
│   │   │   ├── ocr_extractor.py     # Regex patterns RO
│   │   │   └── image_preprocessor.py # OpenCV pipeline
│   │   └── routers/
│   │       ├── receipts.py
│   │       └── ocr.py           # OCR endpoints
│   ├── migrations/              # Alembic migrations
│   ├── data/
│   │   ├── receipts.db          # SQLite database
│   │   └── uploads/             # Uploaded files
│   └── requirements.txt
│
├── frontend/
│   ├── src/
│   │   ├── views/receipts/      # Page components
│   │   ├── components/
│   │   │   ├── receipts/        # Receipt components
│   │   │   └── ocr/             # OCR components
│   │   │       ├── OCRUploadZone.vue
│   │   │       ├── OCRPreview.vue
│   │   │       └── OCRConfidenceIndicator.vue
│   │   ├── stores/              # Pinia stores
│   │   └── router/              # Vue Router
│   ├── package.json
│   └── vite.config.js
│
└── docs/                        # Documentation

Environment Variables

Backend (.env)

# SQLite
SQLITE_DATABASE_PATH=data/receipts.db

# File uploads
UPLOAD_PATH=data/uploads
MAX_UPLOAD_SIZE_MB=10

# Oracle (for nomenclatures)
ORACLE_USER=CONTAFIN_ORACLE
ORACLE_PASSWORD=your_password
ORACLE_HOST=localhost
ORACLE_PORT=1526
ORACLE_SID=ROA

# JWT (shared with Reports module)
JWT_SECRET_KEY=your_secret_key
JWT_ALGORITHM=HS256

Development

Create new migration

cd backend
alembic revision --autogenerate -m "Add new field"
alembic upgrade head

Run tests

# Backend
cd backend && pytest

# Frontend
cd frontend && npm run test

API Documentation

Full API documentation available at http://localhost:8003/docs when backend is running.

Key Endpoints

Method	Endpoint	Description
POST	/api/receipts/	Create receipt
GET	/api/receipts/	List receipts
GET	/api/receipts/{id}	Get receipt details
POST	/api/receipts/{id}/submit	Submit for review
POST	/api/receipts/{id}/approve	Approve receipt
POST	/api/receipts/{id}/reject	Reject receipt
POST	/api/receipts/{id}/attachments	Upload attachment
GET	/api/ocr/status	OCR service status
POST	/api/ocr/extract	OCR image extraction

Troubleshooting

OCR not working

Check OCR status: curl http://localhost:8003/api/ocr/status
Install system dependencies (tesseract, poppler)
Verify PaddleOCR installed: python -c "from paddleocr import PaddleOCR"

OCR Windows - "poppler not in PATH"

# Eroare: "Unable to get page count. Is poppler installed and in PATH?"

# Solutie 1: Adauga Poppler la PATH
# System Properties → Environment Variables → System variables → Path → New
# Adauga: C:\Program Files\poppler\Library\bin

# Solutie 2: Restart serviciul dupa modificarea PATH
nssm restart ROA2WEB-DataEntry

# Verificare:
pdfinfo --version

OCR Windows - "tesseract not found"

# Eroare: "tesseract is not installed or it's not in your PATH"

# Solutie: Adauga Tesseract la PATH
# C:\Program Files\Tesseract-OCR\

# Verificare:
tesseract --version
tesseract --list-langs  # Trebuie sa arate 'ron' si 'eng'

OCR Windows - PaddleOCR import error

# Eroare: "No module named 'paddleocr'"

cd C:\inetpub\wwwroot\roa2web\data-entry-backend
.\venv\Scripts\activate
pip install paddlepaddle>=2.5.0
pip install paddleocr>=2.7.0

# Restart serviciu
nssm restart ROA2WEB-DataEntry

Low OCR accuracy

Ensure good lighting when taking receipt photos
Keep receipt flat (no folds/wrinkles)
Try PDF instead of JPG for scanned documents
Check if text is in focus

Phase 2 (Future)

Oracle sync for approved receipts
Integration with pack_contafin procedures
Automatic posting to ACT/RUL tables