refactor(docs): consolidate and cleanup documentation

- Delete 9 deprecated/obsolete docs (~6,300 lines removed)
- Move test PDFs to tests/fixtures/ocr-samples/
- Create docs/DEPLOYMENT.md as principal guide
- Create tests/ocr-validation/README.md
- Update all refs for ultrathin monolith architecture
- Update OCR tests to use relative paths

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Claude Agent
2026-01-22 09:14:51 +00:00
parent 1b9ebf1d8f
commit 62f86250cc
55 changed files with 604 additions and 6334 deletions

File diff suppressed because it is too large Load Diff

View File

@@ -1,382 +0,0 @@
# Data Entry App - Bonuri Fiscale
Aplicatie pentru introducere bonuri fiscale cu workflow de aprobare si extragere automata date prin OCR.
## Quick Start
### Prerequisites
- Python 3.10+
- Node.js 18+
- (Optional) SSH tunnel pentru Oracle nomenclatoare
### Using Start Script (Recommended)
```bash
# Start all services
./start-data-entry.sh
# Or individual commands:
./start-data-entry.sh start # Start all
./start-data-entry.sh stop # Stop all
./start-data-entry.sh status # Check status
./start-data-entry.sh restart backend # Restart backend only
```
**Services:**
- Backend: http://localhost:8003
- Frontend: http://localhost:3010
- API Docs: http://localhost:8003/docs
### Manual Setup
#### Backend Setup
```bash
cd backend/modules/data_entry/backend
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# sau: venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Create .env file
cp .env.example .env
# Edit .env with your settings
# Run migrations
alembic upgrade head
# Start server
uvicorn app.main:app --reload --port 8003
```
#### Frontend Setup
```bash
cd backend/modules/data_entry/frontend
# Install dependencies
npm install
# Start dev server
npm run dev -- --port 3010
```
## Features
### Pentru Utilizatori
- **OCR Automat** - Extragere automata date din poza bonului (suma, data, furnizor, CUI)
- Upload poze bonuri fiscale
- Completare date bon (suma, data, furnizor)
- Selectie tip cheltuiala
- Trimitere spre aprobare
### Pentru Contabili
- Vizualizare bonuri in asteptare
- Editare note contabile propuse
- Aprobare/Respingere bonuri
- Aprobare in masa
## OCR Feature
### Cum functioneaza
1. **Upload imagine** - Trage sau selecteaza poza bonului
2. **Procesare OCR** - Click pe "Proceseaza cu OCR"
3. **Previzualizare** - Datele extrase sunt afisate cu indicatori de incredere
4. **Aplicare** - Click "Aplica datele in formular" pentru auto-fill
### Campuri extrase automat
| Camp | Acuratete estimata |
|------|-------------------|
| Suma (TOTAL) | 90-95% |
| Data | 85-90% |
| Numar bon | 80-85% |
| Furnizor | 70-80% |
| CUI | 85-90% |
| Tip document | 95%+ |
### OCR System Dependencies (Linux/Docker)
Pentru functionarea OCR trebuie instalate:
```bash
# Ubuntu/Debian
apt-get install -y \
tesseract-ocr \
tesseract-ocr-ron \
tesseract-ocr-eng \
poppler-utils \
libgl1-mesa-glx \
libglib2.0-0
# Fedora/RHEL
dnf install -y \
tesseract \
tesseract-langpack-ron \
tesseract-langpack-eng \
poppler-utils
```
**Note:** PaddleOCR (engine principal) se instaleaza automat cu pip. Tesseract este folosit ca fallback.
### OCR System Dependencies (Windows)
Pe Windows Server trebuie instalate manual urmatoarele componente:
#### 1. Poppler (pentru conversie PDF → imagini)
```powershell
# Descarca Poppler pentru Windows
# https://github.com/osborn/poppler-windows/releases
# sau https://github.com/bblanchon/pdfium-binaries
# Extrage in C:\Program Files\poppler\
# Adauga la PATH: C:\Program Files\poppler\Library\bin
```
#### 2. Tesseract OCR (engine OCR backup)
```powershell
# Descarca installer de la:
# https://github.com/UB-Mannheim/tesseract/wiki
# Instaleaza cu limbile: English + Romanian
# Default path: C:\Program Files\Tesseract-OCR\
# Adauga la PATH
```
#### 3. Python OCR Dependencies (in venv)
```powershell
cd C:\inetpub\wwwroot\roa2web\data-entry-backend
.\venv\Scripts\activate
# Instaleaza dependentele OCR
pip install paddlepaddle>=2.5.0
pip install paddleocr>=2.7.0
pip install opencv-python>=4.8.0
pip install pytesseract>=0.3.10
pip install pdf2image>=1.16.0
# Sau din requirements.txt
pip install -r requirements.txt
```
#### 4. Restart serviciu
```powershell
nssm restart ROA2WEB-DataEntry
```
**Note importante Windows:**
- Prima rulare PaddleOCR descarca modele (~200MB) - poate dura cateva minute
- PaddleOCR necesita ~2GB RAM disponibil
- Verifica PATH-ul pentru Poppler si Tesseract dupa instalare
- Restart serviciul backend dupa orice modificare PATH
### OCR API Endpoints
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | /api/ocr/status | Check OCR service status |
| POST | /api/ocr/extract | Extract data from uploaded image |
| POST | /api/ocr/extract-attachment/{id} | Re-process existing attachment |
### Test OCR
```bash
# Check OCR status
curl http://localhost:8003/api/ocr/status
# Extract from image
curl -X POST -F "file=@bon.jpg" http://localhost:8003/api/ocr/extract
```
## Workflow
```
DRAFT → PENDING_REVIEW → APPROVED/REJECTED → (SYNCED in Oracle)
```
1. **DRAFT**: Utilizator completeaza datele (manual sau via OCR)
2. **PENDING_REVIEW**: Sistemul genereaza note contabile automat
3. **APPROVED**: Contabil a aprobat bonul
4. **REJECTED**: Contabil a respins (utilizatorul poate corecta)
## Project Structure
```
backend/modules/data_entry/
├── backend/
│ ├── app/
│ │ ├── main.py # FastAPI entry point
│ │ ├── config.py # Settings
│ │ ├── db/
│ │ │ ├── database.py # SQLite engine
│ │ │ ├── models/ # SQLModel models
│ │ │ └── crud/ # CRUD operations
│ │ ├── schemas/ # Pydantic schemas
│ │ │ └── ocr.py # OCR response schemas
│ │ ├── services/
│ │ │ ├── receipt_service.py
│ │ │ ├── ocr_service.py # OCR orchestration
│ │ │ ├── ocr_engine.py # PaddleOCR/Tesseract
│ │ │ ├── ocr_extractor.py # Regex patterns RO
│ │ │ └── image_preprocessor.py # OpenCV pipeline
│ │ └── routers/
│ │ ├── receipts.py
│ │ └── ocr.py # OCR endpoints
│ ├── migrations/ # Alembic migrations
│ ├── data/
│ │ ├── receipts.db # SQLite database
│ │ └── uploads/ # Uploaded files
│ └── requirements.txt
├── frontend/
│ ├── src/
│ │ ├── views/receipts/ # Page components
│ │ ├── components/
│ │ │ ├── receipts/ # Receipt components
│ │ │ └── ocr/ # OCR components
│ │ │ ├── OCRUploadZone.vue
│ │ │ ├── OCRPreview.vue
│ │ │ └── OCRConfidenceIndicator.vue
│ │ ├── stores/ # Pinia stores
│ │ └── router/ # Vue Router
│ ├── package.json
│ └── vite.config.js
└── docs/ # Documentation
```
## Environment Variables
### Backend (.env)
```bash
# SQLite
SQLITE_DATABASE_PATH=data/receipts.db
# File uploads
UPLOAD_PATH=data/uploads
MAX_UPLOAD_SIZE_MB=10
# Oracle (for nomenclatures)
ORACLE_USER=CONTAFIN_ORACLE
ORACLE_PASSWORD=your_password
ORACLE_HOST=localhost
ORACLE_PORT=1526
ORACLE_SID=ROA
# JWT (shared with Reports module)
JWT_SECRET_KEY=your_secret_key
JWT_ALGORITHM=HS256
```
## Development
### Create new migration
```bash
cd backend
alembic revision --autogenerate -m "Add new field"
alembic upgrade head
```
### Run tests
```bash
# Backend
cd backend && pytest
# Frontend
cd frontend && npm run test
```
## API Documentation
Full API documentation available at http://localhost:8003/docs when backend is running.
### Key Endpoints
| Method | Endpoint | Description |
|--------|----------|-------------|
| POST | /api/receipts/ | Create receipt |
| GET | /api/receipts/ | List receipts |
| GET | /api/receipts/{id} | Get receipt details |
| POST | /api/receipts/{id}/submit | Submit for review |
| POST | /api/receipts/{id}/approve | Approve receipt |
| POST | /api/receipts/{id}/reject | Reject receipt |
| POST | /api/receipts/{id}/attachments | Upload attachment |
| GET | /api/ocr/status | OCR service status |
| POST | /api/ocr/extract | OCR image extraction |
## Troubleshooting
### OCR not working
1. Check OCR status: `curl http://localhost:8003/api/ocr/status`
2. Install system dependencies (tesseract, poppler)
3. Verify PaddleOCR installed: `python -c "from paddleocr import PaddleOCR"`
### OCR Windows - "poppler not in PATH"
```powershell
# Eroare: "Unable to get page count. Is poppler installed and in PATH?"
# Solutie 1: Adauga Poppler la PATH
# System Properties → Environment Variables → System variables → Path → New
# Adauga: C:\Program Files\poppler\Library\bin
# Solutie 2: Restart serviciul dupa modificarea PATH
nssm restart ROA2WEB-DataEntry
# Verificare:
pdfinfo --version
```
### OCR Windows - "tesseract not found"
```powershell
# Eroare: "tesseract is not installed or it's not in your PATH"
# Solutie: Adauga Tesseract la PATH
# C:\Program Files\Tesseract-OCR\
# Verificare:
tesseract --version
tesseract --list-langs # Trebuie sa arate 'ron' si 'eng'
```
### OCR Windows - PaddleOCR import error
```powershell
# Eroare: "No module named 'paddleocr'"
cd C:\inetpub\wwwroot\roa2web\data-entry-backend
.\venv\Scripts\activate
pip install paddlepaddle>=2.5.0
pip install paddleocr>=2.7.0
# Restart serviciu
nssm restart ROA2WEB-DataEntry
```
### Low OCR accuracy
- Ensure good lighting when taking receipt photos
- Keep receipt flat (no folds/wrinkles)
- Try PDF instead of JPG for scanned documents
- Check if text is in focus
## Phase 2 (Future)
- Oracle sync for approved receipts
- Integration with pack_contafin procedures
- Automatic posting to ACT/RUL tables

Binary file not shown.

Binary file not shown.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

Binary file not shown.

Before

Width:  |  Height:  |  Size: 299 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 323 KiB