feat(ocr): Add docTR OCR engine with metrics infrastructure

Add docTR as primary OCR engine with 2-tier sequential processing, OCR metrics tracking, and simplified engine selection. Features: - docTR OCR engine with light+medium preprocessing tiers - doctr_plus mode with early exit optimization (~65% fast path) - OCR metrics dashboard with per-engine statistics - User OCR preference persistence - Parallel worker pool for OCR processing - Cross-validation for extraction quality Engine options: tesseract, doctr, doctr_plus (recommended), paddleocr 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-02 05:37:16 +02:00
parent 74f7aefc26
commit 495790411f
75 changed files with 23349 additions and 1311 deletions
--- a/docs/OCR_MEMORY_SOLUTIONS_RESEARCH.md
+++ b/docs/OCR_MEMORY_SOLUTIONS_RESEARCH.md
@@ -0,0 +1,220 @@
+# OCR Memory Leak Solutions Research
+
+**Data**: 2026-01-01
+**Problema**: Worker OCR cade după 10 bonuri din cauza memory leak în WSL2
+**Mediu**: WSL2 pe Windows, 8GB RAM alocat, docTR + PyTorch
+
+---
+
+## Diagnosticul Problemei
+
+- **WSL RAM**: 7.5 GB (din 8GB configurat)
+- **WSL Swap**: 2 GB folosit complet (config 8GB nu se aplică?)
+- **Crash**: După ~10 PDF-uri procesate cu hybrid-doctr
+- **Eroare**: "A process in the process pool was terminated abruptly"
+
+---
+
+## Soluții Găsite (ordonate după prioritate)
+
+### ✅ Nivel 1: Environment Variables (ÎNCERCAT PRIMUL)
+
+```python
+# În ocr_worker_process.py sau la începutul aplicației
+import os
+os.environ["DOCTR_MULTIPROCESSING_DISABLE"] = "TRUE"
+os.environ["ONEDNN_PRIMITIVE_CACHE_CAPACITY"] = "1"
+```
+
+**Sursa**: [docTR Issue #1594](https://github.com/mindee/doctr/issues/1594)
+
+---
+
+### ✅ Nivel 2: maxtasksperchild în ProcessPool
+
+```python
+from concurrent.futures import ProcessPoolExecutor
+
+# Worker-ul repornește la fiecare N task-uri
+executor = ProcessPoolExecutor(max_workers=1, max_tasks_per_child=5)
+```
+
+**Efect**: Memoria se eliberează complet când worker-ul repornește
+**Cost**: +10-15s la fiecare 5 job-uri (reload model docTR)
+
+**Sursa**: [Python Multiprocessing Memory](https://www.pythontutorials.net/blog/memory-usage-keep-growing-with-python-s-multiprocessing-pool/)
+
+---
+
+### ✅ Nivel 3: Manual Memory Cleanup
+
+```python
+import gc
+import torch
+
+def process_image(image):
+    with torch.no_grad():  # Dezactivează gradient tracking
+        result = doctr_engine(image)
+
+    # Cleanup explicit
+    del image
+    gc.collect()
+
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+
+    return result
+```
+
+---
+
+### ✅ Nivel 4: WSL Memory Reclaim
+
+**În .wslconfig** (`C:\Users\<username>\.wslconfig`):
+```ini
+[wsl2]
+memory=8GB
+processors=2
+swap=8GB
+
+[experimental]
+autoMemoryReclaim=gradual
+```
+
+**Manual drop caches** (în WSL):
+```bash
+echo 1 | sudo tee /proc/sys/vm/drop_caches
+echo 1 | sudo tee /proc/sys/vm/compact_memory
+```
+
+**După modificare .wslconfig**:
+```powershell
+wsl --shutdown
+```
+
+**Sursa**: [Microsoft DevBlogs - Memory Reclaim](https://devblogs.microsoft.com/commandline/memory-reclaim-in-the-windows-subsystem-for-linux-2/)
+
+⚠️ **Atenție**: `autoMemoryReclaim` poate afecta Docker!
+
+---
+
+### ✅ Nivel 5: Procesare Secvențială cu Cleanup
+
+Restructurare `_process_hybrid_doctr()`:
+
+```python
+def _process_hybrid_doctr_memory_efficient(image, doctr_engine, preprocessor, extractor):
+    results = []
+
+    for pass_type in ['light', 'medium', 'heavy']:
+        # Preprocess
+        if pass_type == 'light':
+            processed = preprocessor.preprocess_light(image)
+        elif pass_type == 'medium':
+            processed = preprocessor.preprocess_medium(image)
+        else:
+            processed = preprocessor.preprocess_heavy(image)
+
+        # OCR
+        with torch.no_grad():
+            ocr_result = _doctr_recognize(doctr_engine, processed)
+
+        # Cleanup IMEDIAT
+        del processed
+        gc.collect()
+
+        if ocr_result:
+            extraction = extractor.extract(ocr_result.text)
+            results.append(extraction)
+
+            # Early exit dacă e suficient de bun
+            if _is_extraction_complete(extraction) and extraction.overall_confidence > 0.9:
+                break
+
+    return _merge_extractions(results)
+```
+
+---
+
+### ✅ Nivel 6: Alternative la ProcessPoolExecutor
+
+**Ray** (mai bun pentru ML):
+```python
+import ray
+
+@ray.remote
+def process_ocr(image_bytes):
+    # Processing...
+    return result
+
+# Ray gestionează memoria mai bine
+result = ray.get(process_ocr.remote(image_bytes))
+```
+
+**Dask**:
+```python
+from dask import delayed
+
+@delayed
+def process_ocr(image_bytes):
+    return result
+```
+
+**Sursa**: [Managing Memory Issues](https://eyxibnib.biz.id/2024/07/05/managing-memory-issues-with-pythons-threadpoolexecutor-and-processpoolexecutor/)
+
+---
+
+### ✅ Nivel 7: Downscale Imagini Mari
+
+```python
+def preprocess_with_size_limit(image, max_size=2000):
+    h, w = image.shape[:2]
+    if max(h, w) > max_size:
+        scale = max_size / max(h, w)
+        image = cv2.resize(image, None, fx=scale, fy=scale, interpolation=cv2.INTER_AREA)
+    return image
+```
+
+---
+
+## Referințe Complete
+
+### docTR Memory Issues
+- [Discussion #1422 - Memory Leak on inference](https://github.com/mindee/doctr/discussions/1422)
+- [Issue #1594 - CPU memory usage increase](https://github.com/mindee/doctr/issues/1594)
+
+### PyTorch Multiprocessing
+- [PyTorch Forums - Memory leak with multiprocessing](https://discuss.pytorch.org/t/memory-leak-with-multiprocessing/209512)
+- [Issue #44156 - CUDA memory leak](https://github.com/pytorch/pytorch/issues/44156)
+- [Issue #13246 - DataLoader memory replication](https://github.com/pytorch/pytorch/issues/13246)
+
+### Python ProcessPoolExecutor
+- [CPython Issue #90943 - Exception memory leak](https://github.com/python/cpython/issues/90943)
+- [NumPy Issue #12122 - Memory leak with ProcessPoolExecutor](https://github.com/numpy/numpy/issues/12122)
+
+### WSL2 Memory
+- [Microsoft - Memory Reclaim in WSL2](https://devblogs.microsoft.com/commandline/memory-reclaim-in-the-windows-subsystem-for-linux-2/)
+- [WSL Issue #4166 - Massive RAM consumption](https://github.com/microsoft/WSL/issues/4166)
+- [Limiting Memory Usage in WSL2](https://www.aleksandrhovhannisyan.com/blog/limiting-memory-usage-in-wsl-2/)
+
+---
+
+## Plan de Implementare
+
+1. ✅ **Pas 1**: Adaugă env vars (DOCTR_MULTIPROCESSING_DISABLE, ONEDNN)
+2. ✅ **Pas 2**: Adaugă maxtasksperchild=5 în worker pool
+3. ⏳ **Pas 3**: Testează - dacă tot cade, adaugă gc.collect() explicit
+4. ⏳ **Pas 4**: Dacă tot cade, restructurează procesarea secvențială
+5. ⏳ **Pas 5**: Dacă tot cade, crește memoria WSL sau folosește Ray
+
+---
+
+## Notă Importantă
+
+Swap-ul din .wslconfig poate să nu se aplice corect. După modificări:
+```powershell
+wsl --shutdown
+# Apoi redeschide WSL
+```
+
+Verifică cu `free -h` că noile setări sunt aplicate.