Add docTR as primary OCR engine with 2-tier sequential processing, OCR metrics tracking, and simplified engine selection. Features: - docTR OCR engine with light+medium preprocessing tiers - doctr_plus mode with early exit optimization (~65% fast path) - OCR metrics dashboard with per-engine statistics - User OCR preference persistence - Parallel worker pool for OCR processing - Cross-validation for extraction quality Engine options: tesseract, doctr, doctr_plus (recommended), paddleocr 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
5.9 KiB
5.9 KiB
OCR Memory Leak Solutions Research
Data: 2026-01-01 Problema: Worker OCR cade după 10 bonuri din cauza memory leak în WSL2 Mediu: WSL2 pe Windows, 8GB RAM alocat, docTR + PyTorch
Diagnosticul Problemei
- WSL RAM: 7.5 GB (din 8GB configurat)
- WSL Swap: 2 GB folosit complet (config 8GB nu se aplică?)
- Crash: După ~10 PDF-uri procesate cu hybrid-doctr
- Eroare: "A process in the process pool was terminated abruptly"
Soluții Găsite (ordonate după prioritate)
✅ Nivel 1: Environment Variables (ÎNCERCAT PRIMUL)
# În ocr_worker_process.py sau la începutul aplicației
import os
os.environ["DOCTR_MULTIPROCESSING_DISABLE"] = "TRUE"
os.environ["ONEDNN_PRIMITIVE_CACHE_CAPACITY"] = "1"
Sursa: docTR Issue #1594
✅ Nivel 2: maxtasksperchild în ProcessPool
from concurrent.futures import ProcessPoolExecutor
# Worker-ul repornește la fiecare N task-uri
executor = ProcessPoolExecutor(max_workers=1, max_tasks_per_child=5)
Efect: Memoria se eliberează complet când worker-ul repornește Cost: +10-15s la fiecare 5 job-uri (reload model docTR)
Sursa: Python Multiprocessing Memory
✅ Nivel 3: Manual Memory Cleanup
import gc
import torch
def process_image(image):
with torch.no_grad(): # Dezactivează gradient tracking
result = doctr_engine(image)
# Cleanup explicit
del image
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
return result
✅ Nivel 4: WSL Memory Reclaim
În .wslconfig (C:\Users\<username>\.wslconfig):
[wsl2]
memory=8GB
processors=2
swap=8GB
[experimental]
autoMemoryReclaim=gradual
Manual drop caches (în WSL):
echo 1 | sudo tee /proc/sys/vm/drop_caches
echo 1 | sudo tee /proc/sys/vm/compact_memory
După modificare .wslconfig:
wsl --shutdown
Sursa: Microsoft DevBlogs - Memory Reclaim
⚠️ Atenție: autoMemoryReclaim poate afecta Docker!
✅ Nivel 5: Procesare Secvențială cu Cleanup
Restructurare _process_hybrid_doctr():
def _process_hybrid_doctr_memory_efficient(image, doctr_engine, preprocessor, extractor):
results = []
for pass_type in ['light', 'medium', 'heavy']:
# Preprocess
if pass_type == 'light':
processed = preprocessor.preprocess_light(image)
elif pass_type == 'medium':
processed = preprocessor.preprocess_medium(image)
else:
processed = preprocessor.preprocess_heavy(image)
# OCR
with torch.no_grad():
ocr_result = _doctr_recognize(doctr_engine, processed)
# Cleanup IMEDIAT
del processed
gc.collect()
if ocr_result:
extraction = extractor.extract(ocr_result.text)
results.append(extraction)
# Early exit dacă e suficient de bun
if _is_extraction_complete(extraction) and extraction.overall_confidence > 0.9:
break
return _merge_extractions(results)
✅ Nivel 6: Alternative la ProcessPoolExecutor
Ray (mai bun pentru ML):
import ray
@ray.remote
def process_ocr(image_bytes):
# Processing...
return result
# Ray gestionează memoria mai bine
result = ray.get(process_ocr.remote(image_bytes))
Dask:
from dask import delayed
@delayed
def process_ocr(image_bytes):
return result
Sursa: Managing Memory Issues
✅ Nivel 7: Downscale Imagini Mari
def preprocess_with_size_limit(image, max_size=2000):
h, w = image.shape[:2]
if max(h, w) > max_size:
scale = max_size / max(h, w)
image = cv2.resize(image, None, fx=scale, fy=scale, interpolation=cv2.INTER_AREA)
return image
Referințe Complete
docTR Memory Issues
PyTorch Multiprocessing
- PyTorch Forums - Memory leak with multiprocessing
- Issue #44156 - CUDA memory leak
- Issue #13246 - DataLoader memory replication
Python ProcessPoolExecutor
- CPython Issue #90943 - Exception memory leak
- NumPy Issue #12122 - Memory leak with ProcessPoolExecutor
WSL2 Memory
- Microsoft - Memory Reclaim in WSL2
- WSL Issue #4166 - Massive RAM consumption
- Limiting Memory Usage in WSL2
Plan de Implementare
- ✅ Pas 1: Adaugă env vars (DOCTR_MULTIPROCESSING_DISABLE, ONEDNN)
- ✅ Pas 2: Adaugă maxtasksperchild=5 în worker pool
- ⏳ Pas 3: Testează - dacă tot cade, adaugă gc.collect() explicit
- ⏳ Pas 4: Dacă tot cade, restructurează procesarea secvențială
- ⏳ Pas 5: Dacă tot cade, crește memoria WSL sau folosește Ray
Notă Importantă
Swap-ul din .wslconfig poate să nu se aplice corect. După modificări:
wsl --shutdown
# Apoi redeschide WSL
Verifică cu free -h că noile setări sunt aplicate.