refactor(docs): consolidate and cleanup documentation

- Delete 9 deprecated/obsolete docs (~6,300 lines removed) - Move test PDFs to tests/fixtures/ocr-samples/ - Create docs/DEPLOYMENT.md as principal guide - Create tests/ocr-validation/README.md - Update all refs for ultrathin monolith architecture - Update OCR tests to use relative paths Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 09:14:51 +00:00
parent 1b9ebf1d8f
commit 62f86250cc
55 changed files with 604 additions and 6334 deletions
--- a/tests/ocr-validation/README.md
+++ b/tests/ocr-validation/README.md
@@ -0,0 +1,158 @@
+# OCR Validation Tests
+
+Teste pentru validarea acurateții extragerii OCR din bonuri fiscale.
+
+---
+
+## Prerequisites
+
+1. **Backend-ul trebuie să ruleze** pe `http://localhost:8000`
+2. **Modulul Data Entry activat** în `.env`: `MODULE_DATA_ENTRY_ENABLED=true`
+3. **JWT_SECRET_KEY** setat (sau folosește default-ul de test)
+
+```bash
+# Pornește backend-ul
+cd /workspace/roa2web
+./start-prod.sh
+# sau
+./start-test.sh
+```
+
+---
+
+## Test Files
+
+| Fișier | Scop |
+|--------|------|
+| `expected_receipts.json` | Expected values pentru fiecare bon (ground truth) |
+| `ocr-direct-validation.py` | Test individual cu comparare detaliată |
+| `test_receipts_sequential.py` | Rulează toate bonurile secvențial |
+| `test_receipts_parallel.py` | Rulează toate bonurile în paralel (performance test) |
+| `test_receipts_parallel_windows.py` | Versiune Windows cu memory tracking |
+| `get_raw_ocr_text.py` | Debug tool - afișează raw OCR text |
+
+**Fixtures:** `tests/fixtures/ocr-samples/` - 30 PDF-uri de bonuri fiscale
+
+---
+
+## Cum să rulezi testele
+
+### 1. Test Individual (Recomandat pentru debug)
+
+```bash
+cd /workspace/roa2web
+
+# Test toate bonurile cu engine doctr_plus
+python tests/ocr-validation/ocr-direct-validation.py
+
+# Test cu engine specific
+python tests/ocr-validation/ocr-direct-validation.py --engine doctr_plus
+python tests/ocr-validation/ocr-direct-validation.py --engine tesseract
+
+# Test doar un bon specific
+python tests/ocr-validation/ocr-direct-validation.py --receipt receipt_01
+
+# Include și bonuri multi-page
+python tests/ocr-validation/ocr-direct-validation.py --include-multipage
+```
+
+### 2. Test Secvențial (Toate bonurile, unul câte unul)
+
+```bash
+python tests/ocr-validation/test_receipts_sequential.py
+```
+
+Output:
+```
+Processing: abonament kineterra.pdf
+  ✓ Total: MATCH (1900.0 = 1900.0)
+  ✓ Date: MATCH (2025-11-10)
+  ✗ CUI: MISMATCH (expected: 31180432, got: 3118043)
+```
+
+### 3. Test Paralel (Performance benchmark)
+
+```bash
+python tests/ocr-validation/test_receipts_parallel.py
+```
+
+Output:
+```
+PARALLEL TEST: 26 receipts
+Phase 1: Submitting all jobs...
+Submitted 26 jobs in 2.3s
+Phase 2: Waiting for results...
+  OK: abonament kineterra.pdf                   12.3s  conf=95%
+  OK: benzina 14 august.pdf                      8.7s  conf=92%
+TOTAL TIME: 45.2s
+```
+
+### 4. Debug Raw OCR Text
+
+```bash
+# Vezi textul raw extras de OCR
+python tests/ocr-validation/get_raw_ocr_text.py
+
+# Sau pentru un fișier specific
+python tests/ocr-validation/get_raw_ocr_text.py tests/fixtures/ocr-samples/benzina\ 14\ august.pdf
+```
+
+---
+
+## Expected Receipts Format
+
+`expected_receipts.json` conține ground truth pentru fiecare bon:
+
+```json
+{
+  "receipts": [
+    {
+      "id": "receipt_01",
+      "filename": "abonament kineterra.pdf",
+      "furnizor": "KINETERRA CONCEPT SRL",
+      "cui_furnizor": "31180432",
+      "total": 1900.0,
+      "tva_details": [],
+      "total_tva": 0.0,
+      "data_bon": "2025-11-10",
+      "notes": "Neplatitor TVA - abonament terapie"
+    }
+  ]
+}
+```
+
+---
+
+## Adaugă bonuri noi pentru testare
+
+1. Pune PDF-ul în `tests/fixtures/ocr-samples/`
+2. Adaugă entry în `expected_receipts.json` cu valorile corecte
+3. Rulează testul:
+   ```bash
+   python tests/ocr-validation/ocr-direct-validation.py --receipt receipt_XX
+   ```
+
+---
+
+## Troubleshooting
+
+### "Connection refused" sau "Failed to connect"
+- Backend-ul nu rulează. Pornește cu `./start-prod.sh`
+
+### "401 Unauthorized"
+- JWT token invalid. Verifică `JWT_SECRET_KEY` în `.env`
+
+### "File not found"
+- Verifică că PDF-urile sunt în `tests/fixtures/ocr-samples/`
+
+### Rezultate incorecte
+- Folosește `get_raw_ocr_text.py` pentru a vedea ce text extrage OCR
+- Verifică dacă bonul e lizibil și de calitate bună
+
+---
+
+## Performance Notes
+
+- **doctr_plus** engine: ~8-15 secunde per bon (GPU accelerated)
+- **tesseract** engine: ~3-5 secunde per bon (CPU only)
+- Testul paralel poate procesa ~26 bonuri în ~45 secunde (vs ~5 minute secvențial)