- Create docs/troubleshooting/ for debugging guides - Move OCR_MEMORY_SOLUTIONS_RESEARCH.md → troubleshooting/OCR_MEMORY_LEAKS.md - Delete outdated docs/data-entry/OCR_PROFILE_TEST_RESULTS.md - Update CLAUDE.md documentation index with troubleshooting reference Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
221 lines
5.9 KiB
Markdown
221 lines
5.9 KiB
Markdown
# OCR Memory Leak Solutions Research
|
|
|
|
**Data**: 2026-01-01
|
|
**Problema**: Worker OCR cade după 10 bonuri din cauza memory leak în WSL2
|
|
**Mediu**: WSL2 pe Windows, 8GB RAM alocat, docTR + PyTorch
|
|
|
|
---
|
|
|
|
## Diagnosticul Problemei
|
|
|
|
- **WSL RAM**: 7.5 GB (din 8GB configurat)
|
|
- **WSL Swap**: 2 GB folosit complet (config 8GB nu se aplică?)
|
|
- **Crash**: După ~10 PDF-uri procesate cu hybrid-doctr
|
|
- **Eroare**: "A process in the process pool was terminated abruptly"
|
|
|
|
---
|
|
|
|
## Soluții Găsite (ordonate după prioritate)
|
|
|
|
### ✅ Nivel 1: Environment Variables (ÎNCERCAT PRIMUL)
|
|
|
|
```python
|
|
# În ocr_worker_process.py sau la începutul aplicației
|
|
import os
|
|
os.environ["DOCTR_MULTIPROCESSING_DISABLE"] = "TRUE"
|
|
os.environ["ONEDNN_PRIMITIVE_CACHE_CAPACITY"] = "1"
|
|
```
|
|
|
|
**Sursa**: [docTR Issue #1594](https://github.com/mindee/doctr/issues/1594)
|
|
|
|
---
|
|
|
|
### ✅ Nivel 2: maxtasksperchild în ProcessPool
|
|
|
|
```python
|
|
from concurrent.futures import ProcessPoolExecutor
|
|
|
|
# Worker-ul repornește la fiecare N task-uri
|
|
executor = ProcessPoolExecutor(max_workers=1, max_tasks_per_child=5)
|
|
```
|
|
|
|
**Efect**: Memoria se eliberează complet când worker-ul repornește
|
|
**Cost**: +10-15s la fiecare 5 job-uri (reload model docTR)
|
|
|
|
**Sursa**: [Python Multiprocessing Memory](https://www.pythontutorials.net/blog/memory-usage-keep-growing-with-python-s-multiprocessing-pool/)
|
|
|
|
---
|
|
|
|
### ✅ Nivel 3: Manual Memory Cleanup
|
|
|
|
```python
|
|
import gc
|
|
import torch
|
|
|
|
def process_image(image):
|
|
with torch.no_grad(): # Dezactivează gradient tracking
|
|
result = doctr_engine(image)
|
|
|
|
# Cleanup explicit
|
|
del image
|
|
gc.collect()
|
|
|
|
if torch.cuda.is_available():
|
|
torch.cuda.empty_cache()
|
|
|
|
return result
|
|
```
|
|
|
|
---
|
|
|
|
### ✅ Nivel 4: WSL Memory Reclaim
|
|
|
|
**În .wslconfig** (`C:\Users\<username>\.wslconfig`):
|
|
```ini
|
|
[wsl2]
|
|
memory=8GB
|
|
processors=2
|
|
swap=8GB
|
|
|
|
[experimental]
|
|
autoMemoryReclaim=gradual
|
|
```
|
|
|
|
**Manual drop caches** (în WSL):
|
|
```bash
|
|
echo 1 | sudo tee /proc/sys/vm/drop_caches
|
|
echo 1 | sudo tee /proc/sys/vm/compact_memory
|
|
```
|
|
|
|
**După modificare .wslconfig**:
|
|
```powershell
|
|
wsl --shutdown
|
|
```
|
|
|
|
**Sursa**: [Microsoft DevBlogs - Memory Reclaim](https://devblogs.microsoft.com/commandline/memory-reclaim-in-the-windows-subsystem-for-linux-2/)
|
|
|
|
⚠️ **Atenție**: `autoMemoryReclaim` poate afecta Docker!
|
|
|
|
---
|
|
|
|
### ✅ Nivel 5: Procesare Secvențială cu Cleanup
|
|
|
|
Restructurare `_process_hybrid_doctr()`:
|
|
|
|
```python
|
|
def _process_hybrid_doctr_memory_efficient(image, doctr_engine, preprocessor, extractor):
|
|
results = []
|
|
|
|
for pass_type in ['light', 'medium', 'heavy']:
|
|
# Preprocess
|
|
if pass_type == 'light':
|
|
processed = preprocessor.preprocess_light(image)
|
|
elif pass_type == 'medium':
|
|
processed = preprocessor.preprocess_medium(image)
|
|
else:
|
|
processed = preprocessor.preprocess_heavy(image)
|
|
|
|
# OCR
|
|
with torch.no_grad():
|
|
ocr_result = _doctr_recognize(doctr_engine, processed)
|
|
|
|
# Cleanup IMEDIAT
|
|
del processed
|
|
gc.collect()
|
|
|
|
if ocr_result:
|
|
extraction = extractor.extract(ocr_result.text)
|
|
results.append(extraction)
|
|
|
|
# Early exit dacă e suficient de bun
|
|
if _is_extraction_complete(extraction) and extraction.overall_confidence > 0.9:
|
|
break
|
|
|
|
return _merge_extractions(results)
|
|
```
|
|
|
|
---
|
|
|
|
### ✅ Nivel 6: Alternative la ProcessPoolExecutor
|
|
|
|
**Ray** (mai bun pentru ML):
|
|
```python
|
|
import ray
|
|
|
|
@ray.remote
|
|
def process_ocr(image_bytes):
|
|
# Processing...
|
|
return result
|
|
|
|
# Ray gestionează memoria mai bine
|
|
result = ray.get(process_ocr.remote(image_bytes))
|
|
```
|
|
|
|
**Dask**:
|
|
```python
|
|
from dask import delayed
|
|
|
|
@delayed
|
|
def process_ocr(image_bytes):
|
|
return result
|
|
```
|
|
|
|
**Sursa**: [Managing Memory Issues](https://eyxibnib.biz.id/2024/07/05/managing-memory-issues-with-pythons-threadpoolexecutor-and-processpoolexecutor/)
|
|
|
|
---
|
|
|
|
### ✅ Nivel 7: Downscale Imagini Mari
|
|
|
|
```python
|
|
def preprocess_with_size_limit(image, max_size=2000):
|
|
h, w = image.shape[:2]
|
|
if max(h, w) > max_size:
|
|
scale = max_size / max(h, w)
|
|
image = cv2.resize(image, None, fx=scale, fy=scale, interpolation=cv2.INTER_AREA)
|
|
return image
|
|
```
|
|
|
|
---
|
|
|
|
## Referințe Complete
|
|
|
|
### docTR Memory Issues
|
|
- [Discussion #1422 - Memory Leak on inference](https://github.com/mindee/doctr/discussions/1422)
|
|
- [Issue #1594 - CPU memory usage increase](https://github.com/mindee/doctr/issues/1594)
|
|
|
|
### PyTorch Multiprocessing
|
|
- [PyTorch Forums - Memory leak with multiprocessing](https://discuss.pytorch.org/t/memory-leak-with-multiprocessing/209512)
|
|
- [Issue #44156 - CUDA memory leak](https://github.com/pytorch/pytorch/issues/44156)
|
|
- [Issue #13246 - DataLoader memory replication](https://github.com/pytorch/pytorch/issues/13246)
|
|
|
|
### Python ProcessPoolExecutor
|
|
- [CPython Issue #90943 - Exception memory leak](https://github.com/python/cpython/issues/90943)
|
|
- [NumPy Issue #12122 - Memory leak with ProcessPoolExecutor](https://github.com/numpy/numpy/issues/12122)
|
|
|
|
### WSL2 Memory
|
|
- [Microsoft - Memory Reclaim in WSL2](https://devblogs.microsoft.com/commandline/memory-reclaim-in-the-windows-subsystem-for-linux-2/)
|
|
- [WSL Issue #4166 - Massive RAM consumption](https://github.com/microsoft/WSL/issues/4166)
|
|
- [Limiting Memory Usage in WSL2](https://www.aleksandrhovhannisyan.com/blog/limiting-memory-usage-in-wsl-2/)
|
|
|
|
---
|
|
|
|
## Plan de Implementare
|
|
|
|
1. ✅ **Pas 1**: Adaugă env vars (DOCTR_MULTIPROCESSING_DISABLE, ONEDNN)
|
|
2. ✅ **Pas 2**: Adaugă maxtasksperchild=5 în worker pool
|
|
3. ⏳ **Pas 3**: Testează - dacă tot cade, adaugă gc.collect() explicit
|
|
4. ⏳ **Pas 4**: Dacă tot cade, restructurează procesarea secvențială
|
|
5. ⏳ **Pas 5**: Dacă tot cade, crește memoria WSL sau folosește Ray
|
|
|
|
---
|
|
|
|
## Notă Importantă
|
|
|
|
Swap-ul din .wslconfig poate să nu se aplice corect. După modificări:
|
|
```powershell
|
|
wsl --shutdown
|
|
# Apoi redeschide WSL
|
|
```
|
|
|
|
Verifică cu `free -h` că noile setări sunt aplicate.
|