# OCR Validation Tests

Teste pentru validarea acurateții extragerii OCR din bonuri fiscale.

---

## Prerequisites

1. **Backend-ul trebuie să ruleze** pe `http://localhost:8000`
2. **Modulul Data Entry activat** în `.env`: `MODULE_DATA_ENTRY_ENABLED=true`
3. **JWT_SECRET_KEY** setat (sau folosește default-ul de test)

```bash
# Pornește backend-ul
cd /workspace/roa2web
./start.sh prod
# sau
./start.sh test
```

---

## Test Files

| Fișier | Scop |
|--------|------|
| `expected_receipts.json` | Expected values pentru fiecare bon (ground truth) |
| `ocr-direct-validation.py` | Test individual cu comparare detaliată |
| `test_receipts_sequential.py` | Rulează toate bonurile secvențial |
| `test_receipts_parallel.py` | Rulează toate bonurile în paralel (performance test) |
| `test_receipts_parallel_windows.py` | Versiune Windows cu memory tracking |
| `get_raw_ocr_text.py` | Debug tool - afișează raw OCR text |

**Fixtures:** `tests/fixtures/ocr-samples/` - 30 PDF-uri de bonuri fiscale

---

## Cum să rulezi testele

### 1. Test Individual (Recomandat pentru debug)

```bash
cd /workspace/roa2web

# Test toate bonurile cu engine doctr_plus
python tests/ocr-validation/ocr-direct-validation.py

# Test cu engine specific
python tests/ocr-validation/ocr-direct-validation.py --engine doctr_plus
python tests/ocr-validation/ocr-direct-validation.py --engine tesseract

# Test doar un bon specific
python tests/ocr-validation/ocr-direct-validation.py --receipt receipt_01

# Include și bonuri multi-page
python tests/ocr-validation/ocr-direct-validation.py --include-multipage
```

### 2. Test Secvențial (Toate bonurile, unul câte unul)

```bash
python tests/ocr-validation/test_receipts_sequential.py
```

Output:
```
Processing: abonament kineterra.pdf
  ✓ Total: MATCH (1900.0 = 1900.0)
  ✓ Date: MATCH (2025-11-10)
  ✗ CUI: MISMATCH (expected: 31180432, got: 3118043)
```

### 3. Test Paralel (Performance benchmark)

```bash
python tests/ocr-validation/test_receipts_parallel.py
```

Output:
```
PARALLEL TEST: 26 receipts
Phase 1: Submitting all jobs...
Submitted 26 jobs in 2.3s
Phase 2: Waiting for results...
  OK: abonament kineterra.pdf                   12.3s  conf=95%
  OK: benzina 14 august.pdf                      8.7s  conf=92%
TOTAL TIME: 45.2s
```

### 4. Debug Raw OCR Text

```bash
# Vezi textul raw extras de OCR
python tests/ocr-validation/get_raw_ocr_text.py

# Sau pentru un fișier specific
python tests/ocr-validation/get_raw_ocr_text.py tests/fixtures/ocr-samples/benzina\ 14\ august.pdf
```

---

## Expected Receipts Format

`expected_receipts.json` conține ground truth pentru fiecare bon:

```json
{
  "receipts": [
    {
      "id": "receipt_01",
      "filename": "abonament kineterra.pdf",
      "furnizor": "KINETERRA CONCEPT SRL",
      "cui_furnizor": "31180432",
      "total": 1900.0,
      "tva_details": [],
      "total_tva": 0.0,
      "data_bon": "2025-11-10",
      "notes": "Neplatitor TVA - abonament terapie"
    }
  ]
}
```

---

## Adaugă bonuri noi pentru testare

1. Pune PDF-ul în `tests/fixtures/ocr-samples/`
2. Adaugă entry în `expected_receipts.json` cu valorile corecte
3. Rulează testul:
   ```bash
   python tests/ocr-validation/ocr-direct-validation.py --receipt receipt_XX
   ```

---

## Troubleshooting

### "Connection refused" sau "Failed to connect"
- Backend-ul nu rulează. Pornește cu `./start.sh prod`

### "401 Unauthorized"
- JWT token invalid. Verifică `JWT_SECRET_KEY` în `.env`

### "File not found"
- Verifică că PDF-urile sunt în `tests/fixtures/ocr-samples/`

### Rezultate incorecte
- Folosește `get_raw_ocr_text.py` pentru a vedea ce text extrage OCR
- Verifică dacă bonul e lizibil și de calitate bună

---

## Performance Notes

- **doctr_plus** engine: ~8-15 secunde per bon (GPU accelerated)
- **tesseract** engine: ~3-5 secunde per bon (CPU only)
- Testul paralel poate procesa ~26 bonuri în ~45 secunde (vs ~5 minute secvențial)