# OCR Memory Leak Solutions Research **Data**: 2026-01-01 **Problema**: Worker OCR cade după 10 bonuri din cauza memory leak în WSL2 **Mediu**: WSL2 pe Windows, 8GB RAM alocat, docTR + PyTorch --- ## Diagnosticul Problemei - **WSL RAM**: 7.5 GB (din 8GB configurat) - **WSL Swap**: 2 GB folosit complet (config 8GB nu se aplică?) - **Crash**: După ~10 PDF-uri procesate cu hybrid-doctr - **Eroare**: "A process in the process pool was terminated abruptly" --- ## Soluții Găsite (ordonate după prioritate) ### ✅ Nivel 1: Environment Variables (ÎNCERCAT PRIMUL) ```python # În ocr_worker_process.py sau la începutul aplicației import os os.environ["DOCTR_MULTIPROCESSING_DISABLE"] = "TRUE" os.environ["ONEDNN_PRIMITIVE_CACHE_CAPACITY"] = "1" ``` **Sursa**: [docTR Issue #1594](https://github.com/mindee/doctr/issues/1594) --- ### ✅ Nivel 2: maxtasksperchild în ProcessPool ```python from concurrent.futures import ProcessPoolExecutor # Worker-ul repornește la fiecare N task-uri executor = ProcessPoolExecutor(max_workers=1, max_tasks_per_child=5) ``` **Efect**: Memoria se eliberează complet când worker-ul repornește **Cost**: +10-15s la fiecare 5 job-uri (reload model docTR) **Sursa**: [Python Multiprocessing Memory](https://www.pythontutorials.net/blog/memory-usage-keep-growing-with-python-s-multiprocessing-pool/) --- ### ✅ Nivel 3: Manual Memory Cleanup ```python import gc import torch def process_image(image): with torch.no_grad(): # Dezactivează gradient tracking result = doctr_engine(image) # Cleanup explicit del image gc.collect() if torch.cuda.is_available(): torch.cuda.empty_cache() return result ``` --- ### ✅ Nivel 4: WSL Memory Reclaim **În .wslconfig** (`C:\Users\\.wslconfig`): ```ini [wsl2] memory=8GB processors=2 swap=8GB [experimental] autoMemoryReclaim=gradual ``` **Manual drop caches** (în WSL): ```bash echo 1 | sudo tee /proc/sys/vm/drop_caches echo 1 | sudo tee /proc/sys/vm/compact_memory ``` **După modificare .wslconfig**: ```powershell wsl --shutdown ``` **Sursa**: [Microsoft DevBlogs - Memory Reclaim](https://devblogs.microsoft.com/commandline/memory-reclaim-in-the-windows-subsystem-for-linux-2/) ⚠️ **Atenție**: `autoMemoryReclaim` poate afecta Docker! --- ### ✅ Nivel 5: Procesare Secvențială cu Cleanup Restructurare `_process_hybrid_doctr()`: ```python def _process_hybrid_doctr_memory_efficient(image, doctr_engine, preprocessor, extractor): results = [] for pass_type in ['light', 'medium', 'heavy']: # Preprocess if pass_type == 'light': processed = preprocessor.preprocess_light(image) elif pass_type == 'medium': processed = preprocessor.preprocess_medium(image) else: processed = preprocessor.preprocess_heavy(image) # OCR with torch.no_grad(): ocr_result = _doctr_recognize(doctr_engine, processed) # Cleanup IMEDIAT del processed gc.collect() if ocr_result: extraction = extractor.extract(ocr_result.text) results.append(extraction) # Early exit dacă e suficient de bun if _is_extraction_complete(extraction) and extraction.overall_confidence > 0.9: break return _merge_extractions(results) ``` --- ### ✅ Nivel 6: Alternative la ProcessPoolExecutor **Ray** (mai bun pentru ML): ```python import ray @ray.remote def process_ocr(image_bytes): # Processing... return result # Ray gestionează memoria mai bine result = ray.get(process_ocr.remote(image_bytes)) ``` **Dask**: ```python from dask import delayed @delayed def process_ocr(image_bytes): return result ``` **Sursa**: [Managing Memory Issues](https://eyxibnib.biz.id/2024/07/05/managing-memory-issues-with-pythons-threadpoolexecutor-and-processpoolexecutor/) --- ### ✅ Nivel 7: Downscale Imagini Mari ```python def preprocess_with_size_limit(image, max_size=2000): h, w = image.shape[:2] if max(h, w) > max_size: scale = max_size / max(h, w) image = cv2.resize(image, None, fx=scale, fy=scale, interpolation=cv2.INTER_AREA) return image ``` --- ## Referințe Complete ### docTR Memory Issues - [Discussion #1422 - Memory Leak on inference](https://github.com/mindee/doctr/discussions/1422) - [Issue #1594 - CPU memory usage increase](https://github.com/mindee/doctr/issues/1594) ### PyTorch Multiprocessing - [PyTorch Forums - Memory leak with multiprocessing](https://discuss.pytorch.org/t/memory-leak-with-multiprocessing/209512) - [Issue #44156 - CUDA memory leak](https://github.com/pytorch/pytorch/issues/44156) - [Issue #13246 - DataLoader memory replication](https://github.com/pytorch/pytorch/issues/13246) ### Python ProcessPoolExecutor - [CPython Issue #90943 - Exception memory leak](https://github.com/python/cpython/issues/90943) - [NumPy Issue #12122 - Memory leak with ProcessPoolExecutor](https://github.com/numpy/numpy/issues/12122) ### WSL2 Memory - [Microsoft - Memory Reclaim in WSL2](https://devblogs.microsoft.com/commandline/memory-reclaim-in-the-windows-subsystem-for-linux-2/) - [WSL Issue #4166 - Massive RAM consumption](https://github.com/microsoft/WSL/issues/4166) - [Limiting Memory Usage in WSL2](https://www.aleksandrhovhannisyan.com/blog/limiting-memory-usage-in-wsl-2/) --- ## Plan de Implementare 1. ✅ **Pas 1**: Adaugă env vars (DOCTR_MULTIPROCESSING_DISABLE, ONEDNN) 2. ✅ **Pas 2**: Adaugă maxtasksperchild=5 în worker pool 3. ⏳ **Pas 3**: Testează - dacă tot cade, adaugă gc.collect() explicit 4. ⏳ **Pas 4**: Dacă tot cade, restructurează procesarea secvențială 5. ⏳ **Pas 5**: Dacă tot cade, crește memoria WSL sau folosește Ray --- ## Notă Importantă Swap-ul din .wslconfig poate să nu se aplice corect. După modificări: ```powershell wsl --shutdown # Apoi redeschide WSL ``` Verifică cu `free -h` că noile setări sunt aplicate.