feat(ocr): Add docTR OCR engine with metrics infrastructure

Add docTR as primary OCR engine with 2-tier sequential processing,
OCR metrics tracking, and simplified engine selection.

Features:
- docTR OCR engine with light+medium preprocessing tiers
- doctr_plus mode with early exit optimization (~65% fast path)
- OCR metrics dashboard with per-engine statistics
- User OCR preference persistence
- Parallel worker pool for OCR processing
- Cross-validation for extraction quality

Engine options: tesseract, doctr, doctr_plus (recommended), paddleocr

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-02 05:37:16 +02:00
parent 74f7aefc26
commit 495790411f
75 changed files with 23349 additions and 1311 deletions

View File

@@ -0,0 +1,220 @@
# OCR Memory Leak Solutions Research
**Data**: 2026-01-01
**Problema**: Worker OCR cade după 10 bonuri din cauza memory leak în WSL2
**Mediu**: WSL2 pe Windows, 8GB RAM alocat, docTR + PyTorch
---
## Diagnosticul Problemei
- **WSL RAM**: 7.5 GB (din 8GB configurat)
- **WSL Swap**: 2 GB folosit complet (config 8GB nu se aplică?)
- **Crash**: După ~10 PDF-uri procesate cu hybrid-doctr
- **Eroare**: "A process in the process pool was terminated abruptly"
---
## Soluții Găsite (ordonate după prioritate)
### ✅ Nivel 1: Environment Variables (ÎNCERCAT PRIMUL)
```python
# În ocr_worker_process.py sau la începutul aplicației
import os
os.environ["DOCTR_MULTIPROCESSING_DISABLE"] = "TRUE"
os.environ["ONEDNN_PRIMITIVE_CACHE_CAPACITY"] = "1"
```
**Sursa**: [docTR Issue #1594](https://github.com/mindee/doctr/issues/1594)
---
### ✅ Nivel 2: maxtasksperchild în ProcessPool
```python
from concurrent.futures import ProcessPoolExecutor
# Worker-ul repornește la fiecare N task-uri
executor = ProcessPoolExecutor(max_workers=1, max_tasks_per_child=5)
```
**Efect**: Memoria se eliberează complet când worker-ul repornește
**Cost**: +10-15s la fiecare 5 job-uri (reload model docTR)
**Sursa**: [Python Multiprocessing Memory](https://www.pythontutorials.net/blog/memory-usage-keep-growing-with-python-s-multiprocessing-pool/)
---
### ✅ Nivel 3: Manual Memory Cleanup
```python
import gc
import torch
def process_image(image):
with torch.no_grad(): # Dezactivează gradient tracking
result = doctr_engine(image)
# Cleanup explicit
del image
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
return result
```
---
### ✅ Nivel 4: WSL Memory Reclaim
**În .wslconfig** (`C:\Users\<username>\.wslconfig`):
```ini
[wsl2]
memory=8GB
processors=2
swap=8GB
[experimental]
autoMemoryReclaim=gradual
```
**Manual drop caches** (în WSL):
```bash
echo 1 | sudo tee /proc/sys/vm/drop_caches
echo 1 | sudo tee /proc/sys/vm/compact_memory
```
**După modificare .wslconfig**:
```powershell
wsl --shutdown
```
**Sursa**: [Microsoft DevBlogs - Memory Reclaim](https://devblogs.microsoft.com/commandline/memory-reclaim-in-the-windows-subsystem-for-linux-2/)
⚠️ **Atenție**: `autoMemoryReclaim` poate afecta Docker!
---
### ✅ Nivel 5: Procesare Secvențială cu Cleanup
Restructurare `_process_hybrid_doctr()`:
```python
def _process_hybrid_doctr_memory_efficient(image, doctr_engine, preprocessor, extractor):
results = []
for pass_type in ['light', 'medium', 'heavy']:
# Preprocess
if pass_type == 'light':
processed = preprocessor.preprocess_light(image)
elif pass_type == 'medium':
processed = preprocessor.preprocess_medium(image)
else:
processed = preprocessor.preprocess_heavy(image)
# OCR
with torch.no_grad():
ocr_result = _doctr_recognize(doctr_engine, processed)
# Cleanup IMEDIAT
del processed
gc.collect()
if ocr_result:
extraction = extractor.extract(ocr_result.text)
results.append(extraction)
# Early exit dacă e suficient de bun
if _is_extraction_complete(extraction) and extraction.overall_confidence > 0.9:
break
return _merge_extractions(results)
```
---
### ✅ Nivel 6: Alternative la ProcessPoolExecutor
**Ray** (mai bun pentru ML):
```python
import ray
@ray.remote
def process_ocr(image_bytes):
# Processing...
return result
# Ray gestionează memoria mai bine
result = ray.get(process_ocr.remote(image_bytes))
```
**Dask**:
```python
from dask import delayed
@delayed
def process_ocr(image_bytes):
return result
```
**Sursa**: [Managing Memory Issues](https://eyxibnib.biz.id/2024/07/05/managing-memory-issues-with-pythons-threadpoolexecutor-and-processpoolexecutor/)
---
### ✅ Nivel 7: Downscale Imagini Mari
```python
def preprocess_with_size_limit(image, max_size=2000):
h, w = image.shape[:2]
if max(h, w) > max_size:
scale = max_size / max(h, w)
image = cv2.resize(image, None, fx=scale, fy=scale, interpolation=cv2.INTER_AREA)
return image
```
---
## Referințe Complete
### docTR Memory Issues
- [Discussion #1422 - Memory Leak on inference](https://github.com/mindee/doctr/discussions/1422)
- [Issue #1594 - CPU memory usage increase](https://github.com/mindee/doctr/issues/1594)
### PyTorch Multiprocessing
- [PyTorch Forums - Memory leak with multiprocessing](https://discuss.pytorch.org/t/memory-leak-with-multiprocessing/209512)
- [Issue #44156 - CUDA memory leak](https://github.com/pytorch/pytorch/issues/44156)
- [Issue #13246 - DataLoader memory replication](https://github.com/pytorch/pytorch/issues/13246)
### Python ProcessPoolExecutor
- [CPython Issue #90943 - Exception memory leak](https://github.com/python/cpython/issues/90943)
- [NumPy Issue #12122 - Memory leak with ProcessPoolExecutor](https://github.com/numpy/numpy/issues/12122)
### WSL2 Memory
- [Microsoft - Memory Reclaim in WSL2](https://devblogs.microsoft.com/commandline/memory-reclaim-in-the-windows-subsystem-for-linux-2/)
- [WSL Issue #4166 - Massive RAM consumption](https://github.com/microsoft/WSL/issues/4166)
- [Limiting Memory Usage in WSL2](https://www.aleksandrhovhannisyan.com/blog/limiting-memory-usage-in-wsl-2/)
---
## Plan de Implementare
1.**Pas 1**: Adaugă env vars (DOCTR_MULTIPROCESSING_DISABLE, ONEDNN)
2.**Pas 2**: Adaugă maxtasksperchild=5 în worker pool
3.**Pas 3**: Testează - dacă tot cade, adaugă gc.collect() explicit
4.**Pas 4**: Dacă tot cade, restructurează procesarea secvențială
5.**Pas 5**: Dacă tot cade, crește memoria WSL sau folosește Ray
---
## Notă Importantă
Swap-ul din .wslconfig poate să nu se aplice corect. După modificări:
```powershell
wsl --shutdown
# Apoi redeschide WSL
```
Verifică cu `free -h` că noile setări sunt aplicate.

106
docs/OCR_TEST_RESULTS.md Normal file
View File

@@ -0,0 +1,106 @@
# OCR Test Results - docTR+ Engine
**Date:** 2026-01-02 | **Receipts:** 26 | **Test:** Sequential
## Summary Comparison
| Workers | Avg | Total | Mem Used | Mem Avail |
|---------|-----|-------|----------|-----------|
| 1 | 6.8s | 176s | 3.2GB | 4.1GB |
| 2 | 7.2s | 187s | 3.1GB | 4.1GB |
| 3 | 6.8s | 176s | 3.9GB | 3.3GB |
**Success Rate:** 80.8% (21/26) - same for all configs
**Note:** For sequential tests, 1 worker ≈ 3 workers speed!
Multiple workers only help with parallel requests.
## Detailed Results (1 Worker)
| # | Receipt | Time | Tier | Result | Notes |
|---|---------|------|------|--------|-------|
| 01 | abonament kineterra | 6.8s | T1 | ✓ | 97% |
| 02 | benzina 14 august | 6.0s | T1 | ✓ | 83% |
| 03 | benzina 27 octombrie | 5.9s | T1 | ✓ | 83% |
| 04 | igiena 11 octombrie | 7.7s | T1 | ✓ | 97% |
| 05 | igiena 14 dec five-holding | 11.5s | T1+T2 | ✗ | TOTAL ±1 |
| 06 | rechizite 12 dec pictus | 5.9s | T1 | ✓ | 97% |
| 07 | benzina 10 mai 2025 | 5.1s | T1 | ✓ | 83% |
| 08 | brick consumabil 604 50% | 4.8s | T1 | ✓ | 97% |
| 09 | benzina 13 septembrie | 4.9s | T1 | ✓ | 83% |
| 10 | brick consumabile 604 | 5.3s | T1 | ✓ | 97% |
| 11 | benzina 20 dec | 5.8s | T1 | ✓ | 79% |
| 12 | bon fiscal Dedeman | 5.7s | T1 | ✓ | 90% |
| 13 | factura Dedeman | 6.8s | T1 | ✓ | 97% |
| 14 | benzina 13 iulie | 5.7s | T1 | ✓ | 95% |
| 15 | best print stampila | 4.5s | T1 | ✓ | 94% |
| 16 | electrobering telecomanda | 4.8s | T1 | ✓ | 97% |
| 17 | brick igiena 8 oct | 11.9s | T1+T2 | ✗ | TOTAL/CUI |
| 18 | gama ink refill toner | 5.9s | T1 | ✓ | 94% |
| 19 | kineterra fizioterapie | 4.6s | T1 | ✓ | 97% |
| 20 | brick igiena 1 sept | 12.5s | T1+T2 | ✗ | ALL None |
| 21 | kineterra abonament | 5.6s | T1 | ✓ | 97% |
| 22 | brick igiena electrice | 15.9s | T1+T2 | ✗ | DATE None |
| 23 | electrobering igiena | 4.4s | T1 | ✓ | 97% |
| 24 | Lidl papetarie 604 | 5.8s | T1 | ✓ | 87% |
| 25 | brick igiena 604 | 6.8s | T1 | ✗ | DATE ±1 |
| 26 | unlimited duplicat | 4.8s | T1 | ✓ | 86% |
## Time Comparison by Receipt
| # | Receipt | 1W | 2W | 3W |
|---|---------|----|----|-----|
| 01 | abonament kineterra | 6.8s | 6.7s | 5.8s |
| 02 | benzina 14 august | 6.0s | 5.5s | 5.8s |
| 03 | benzina 27 octombrie | 5.9s | 5.9s | 5.7s |
| 04 | igiena 11 octombrie | 7.7s | 8.9s | 7.4s |
| 05 | igiena 14 dec (FAIL) | 11.5s | 12.3s | 11.9s |
| 06 | rechizite pictus | 5.9s | 5.9s | 5.7s |
| 07 | benzina 10 mai | 5.1s | 6.0s | 5.8s |
| 08 | brick 50% | 4.8s | 5.9s | 5.5s |
| 09 | benzina 13 sept | 4.9s | 5.9s | 5.3s |
| 10 | brick consumabile | 5.3s | 5.7s | 5.7s |
| 11 | benzina 20 dec | 5.8s | 5.4s | 5.8s |
| 12 | bon Dedeman | 5.7s | 5.9s | 5.8s |
| 13 | factura Dedeman | 6.8s | 6.9s | 6.8s |
| 14 | benzina 13 iulie | 5.7s | 6.1s | 5.4s |
| 15 | best print | 4.5s | 5.8s | 4.8s |
| 16 | electrobering | 4.8s | 4.2s | 4.7s |
| 17 | brick 8 oct (FAIL) | 11.9s | 13.1s | 12.0s |
| 18 | gama ink | 5.9s | 5.9s | 4.7s |
| 19 | kineterra fizioterapie | 4.6s | 5.9s | 4.8s |
| 20 | brick 1 sept (FAIL) | 12.5s | 13.2s | 13.1s |
| 21 | kineterra abonament | 5.6s | 4.9s | 4.8s |
| 22 | brick electrice (FAIL) | 15.9s | 17.0s | 15.5s |
| 23 | electrobering igiena | 4.4s | 5.4s | 5.0s |
| 24 | Lidl papetarie | 5.8s | 6.9s | 5.8s |
| 25 | brick 604 (FAIL) | 6.8s | 6.5s | 6.9s |
| 26 | unlimited duplicat | 4.8s | 5.8s | 5.0s |
|---|---------|----|----|-----|
| **AVG** | | **6.8s** | **7.2s** | **6.8s** |
| **TOTAL** | | **176s** | **187s** | **176s** |
## Tier Analysis
- **T1 only (early exit):** 21 receipts (~5-6s)
- **T1+T2 (full):** 5 receipts (~12-16s)
## Failures (5)
| Receipt | Issue | Fixable |
|---------|-------|---------|
| igiena 14 dec | TOTAL ±1 | No |
| brick 8 oct | TOTAL/CUI | Maybe |
| brick 1 sept | ALL None | No (bad doc) |
| brick electrice | DATE None | Maybe |
| brick 604 | DATE ±1 | No |
## Recommendation
```
OCR_WORKERS=1 # Best for sequential, saves RAM
OCR_WORKERS=2 # For parallel requests (production)
OCR_MAX_TASKS_PER_CHILD=0 # No restart
```
**For 8GB RAM:** Use 1-2 workers max

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

Binary file not shown.

Binary file not shown.

File diff suppressed because it is too large Load Diff

Binary file not shown.

Binary file not shown.

Binary file not shown.