feat(ocr): Add docTR OCR engine with metrics infrastructure
Add docTR as primary OCR engine with 2-tier sequential processing, OCR metrics tracking, and simplified engine selection. Features: - docTR OCR engine with light+medium preprocessing tiers - doctr_plus mode with early exit optimization (~65% fast path) - OCR metrics dashboard with per-engine statistics - User OCR preference persistence - Parallel worker pool for OCR processing - Cross-validation for extraction quality Engine options: tesseract, doctr, doctr_plus (recommended), paddleocr 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
220
docs/OCR_MEMORY_SOLUTIONS_RESEARCH.md
Normal file
220
docs/OCR_MEMORY_SOLUTIONS_RESEARCH.md
Normal file
@@ -0,0 +1,220 @@
|
||||
# OCR Memory Leak Solutions Research
|
||||
|
||||
**Data**: 2026-01-01
|
||||
**Problema**: Worker OCR cade după 10 bonuri din cauza memory leak în WSL2
|
||||
**Mediu**: WSL2 pe Windows, 8GB RAM alocat, docTR + PyTorch
|
||||
|
||||
---
|
||||
|
||||
## Diagnosticul Problemei
|
||||
|
||||
- **WSL RAM**: 7.5 GB (din 8GB configurat)
|
||||
- **WSL Swap**: 2 GB folosit complet (config 8GB nu se aplică?)
|
||||
- **Crash**: După ~10 PDF-uri procesate cu hybrid-doctr
|
||||
- **Eroare**: "A process in the process pool was terminated abruptly"
|
||||
|
||||
---
|
||||
|
||||
## Soluții Găsite (ordonate după prioritate)
|
||||
|
||||
### ✅ Nivel 1: Environment Variables (ÎNCERCAT PRIMUL)
|
||||
|
||||
```python
|
||||
# În ocr_worker_process.py sau la începutul aplicației
|
||||
import os
|
||||
os.environ["DOCTR_MULTIPROCESSING_DISABLE"] = "TRUE"
|
||||
os.environ["ONEDNN_PRIMITIVE_CACHE_CAPACITY"] = "1"
|
||||
```
|
||||
|
||||
**Sursa**: [docTR Issue #1594](https://github.com/mindee/doctr/issues/1594)
|
||||
|
||||
---
|
||||
|
||||
### ✅ Nivel 2: maxtasksperchild în ProcessPool
|
||||
|
||||
```python
|
||||
from concurrent.futures import ProcessPoolExecutor
|
||||
|
||||
# Worker-ul repornește la fiecare N task-uri
|
||||
executor = ProcessPoolExecutor(max_workers=1, max_tasks_per_child=5)
|
||||
```
|
||||
|
||||
**Efect**: Memoria se eliberează complet când worker-ul repornește
|
||||
**Cost**: +10-15s la fiecare 5 job-uri (reload model docTR)
|
||||
|
||||
**Sursa**: [Python Multiprocessing Memory](https://www.pythontutorials.net/blog/memory-usage-keep-growing-with-python-s-multiprocessing-pool/)
|
||||
|
||||
---
|
||||
|
||||
### ✅ Nivel 3: Manual Memory Cleanup
|
||||
|
||||
```python
|
||||
import gc
|
||||
import torch
|
||||
|
||||
def process_image(image):
|
||||
with torch.no_grad(): # Dezactivează gradient tracking
|
||||
result = doctr_engine(image)
|
||||
|
||||
# Cleanup explicit
|
||||
del image
|
||||
gc.collect()
|
||||
|
||||
if torch.cuda.is_available():
|
||||
torch.cuda.empty_cache()
|
||||
|
||||
return result
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### ✅ Nivel 4: WSL Memory Reclaim
|
||||
|
||||
**În .wslconfig** (`C:\Users\<username>\.wslconfig`):
|
||||
```ini
|
||||
[wsl2]
|
||||
memory=8GB
|
||||
processors=2
|
||||
swap=8GB
|
||||
|
||||
[experimental]
|
||||
autoMemoryReclaim=gradual
|
||||
```
|
||||
|
||||
**Manual drop caches** (în WSL):
|
||||
```bash
|
||||
echo 1 | sudo tee /proc/sys/vm/drop_caches
|
||||
echo 1 | sudo tee /proc/sys/vm/compact_memory
|
||||
```
|
||||
|
||||
**După modificare .wslconfig**:
|
||||
```powershell
|
||||
wsl --shutdown
|
||||
```
|
||||
|
||||
**Sursa**: [Microsoft DevBlogs - Memory Reclaim](https://devblogs.microsoft.com/commandline/memory-reclaim-in-the-windows-subsystem-for-linux-2/)
|
||||
|
||||
⚠️ **Atenție**: `autoMemoryReclaim` poate afecta Docker!
|
||||
|
||||
---
|
||||
|
||||
### ✅ Nivel 5: Procesare Secvențială cu Cleanup
|
||||
|
||||
Restructurare `_process_hybrid_doctr()`:
|
||||
|
||||
```python
|
||||
def _process_hybrid_doctr_memory_efficient(image, doctr_engine, preprocessor, extractor):
|
||||
results = []
|
||||
|
||||
for pass_type in ['light', 'medium', 'heavy']:
|
||||
# Preprocess
|
||||
if pass_type == 'light':
|
||||
processed = preprocessor.preprocess_light(image)
|
||||
elif pass_type == 'medium':
|
||||
processed = preprocessor.preprocess_medium(image)
|
||||
else:
|
||||
processed = preprocessor.preprocess_heavy(image)
|
||||
|
||||
# OCR
|
||||
with torch.no_grad():
|
||||
ocr_result = _doctr_recognize(doctr_engine, processed)
|
||||
|
||||
# Cleanup IMEDIAT
|
||||
del processed
|
||||
gc.collect()
|
||||
|
||||
if ocr_result:
|
||||
extraction = extractor.extract(ocr_result.text)
|
||||
results.append(extraction)
|
||||
|
||||
# Early exit dacă e suficient de bun
|
||||
if _is_extraction_complete(extraction) and extraction.overall_confidence > 0.9:
|
||||
break
|
||||
|
||||
return _merge_extractions(results)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### ✅ Nivel 6: Alternative la ProcessPoolExecutor
|
||||
|
||||
**Ray** (mai bun pentru ML):
|
||||
```python
|
||||
import ray
|
||||
|
||||
@ray.remote
|
||||
def process_ocr(image_bytes):
|
||||
# Processing...
|
||||
return result
|
||||
|
||||
# Ray gestionează memoria mai bine
|
||||
result = ray.get(process_ocr.remote(image_bytes))
|
||||
```
|
||||
|
||||
**Dask**:
|
||||
```python
|
||||
from dask import delayed
|
||||
|
||||
@delayed
|
||||
def process_ocr(image_bytes):
|
||||
return result
|
||||
```
|
||||
|
||||
**Sursa**: [Managing Memory Issues](https://eyxibnib.biz.id/2024/07/05/managing-memory-issues-with-pythons-threadpoolexecutor-and-processpoolexecutor/)
|
||||
|
||||
---
|
||||
|
||||
### ✅ Nivel 7: Downscale Imagini Mari
|
||||
|
||||
```python
|
||||
def preprocess_with_size_limit(image, max_size=2000):
|
||||
h, w = image.shape[:2]
|
||||
if max(h, w) > max_size:
|
||||
scale = max_size / max(h, w)
|
||||
image = cv2.resize(image, None, fx=scale, fy=scale, interpolation=cv2.INTER_AREA)
|
||||
return image
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Referințe Complete
|
||||
|
||||
### docTR Memory Issues
|
||||
- [Discussion #1422 - Memory Leak on inference](https://github.com/mindee/doctr/discussions/1422)
|
||||
- [Issue #1594 - CPU memory usage increase](https://github.com/mindee/doctr/issues/1594)
|
||||
|
||||
### PyTorch Multiprocessing
|
||||
- [PyTorch Forums - Memory leak with multiprocessing](https://discuss.pytorch.org/t/memory-leak-with-multiprocessing/209512)
|
||||
- [Issue #44156 - CUDA memory leak](https://github.com/pytorch/pytorch/issues/44156)
|
||||
- [Issue #13246 - DataLoader memory replication](https://github.com/pytorch/pytorch/issues/13246)
|
||||
|
||||
### Python ProcessPoolExecutor
|
||||
- [CPython Issue #90943 - Exception memory leak](https://github.com/python/cpython/issues/90943)
|
||||
- [NumPy Issue #12122 - Memory leak with ProcessPoolExecutor](https://github.com/numpy/numpy/issues/12122)
|
||||
|
||||
### WSL2 Memory
|
||||
- [Microsoft - Memory Reclaim in WSL2](https://devblogs.microsoft.com/commandline/memory-reclaim-in-the-windows-subsystem-for-linux-2/)
|
||||
- [WSL Issue #4166 - Massive RAM consumption](https://github.com/microsoft/WSL/issues/4166)
|
||||
- [Limiting Memory Usage in WSL2](https://www.aleksandrhovhannisyan.com/blog/limiting-memory-usage-in-wsl-2/)
|
||||
|
||||
---
|
||||
|
||||
## Plan de Implementare
|
||||
|
||||
1. ✅ **Pas 1**: Adaugă env vars (DOCTR_MULTIPROCESSING_DISABLE, ONEDNN)
|
||||
2. ✅ **Pas 2**: Adaugă maxtasksperchild=5 în worker pool
|
||||
3. ⏳ **Pas 3**: Testează - dacă tot cade, adaugă gc.collect() explicit
|
||||
4. ⏳ **Pas 4**: Dacă tot cade, restructurează procesarea secvențială
|
||||
5. ⏳ **Pas 5**: Dacă tot cade, crește memoria WSL sau folosește Ray
|
||||
|
||||
---
|
||||
|
||||
## Notă Importantă
|
||||
|
||||
Swap-ul din .wslconfig poate să nu se aplice corect. După modificări:
|
||||
```powershell
|
||||
wsl --shutdown
|
||||
# Apoi redeschide WSL
|
||||
```
|
||||
|
||||
Verifică cu `free -h` că noile setări sunt aplicate.
|
||||
106
docs/OCR_TEST_RESULTS.md
Normal file
106
docs/OCR_TEST_RESULTS.md
Normal file
@@ -0,0 +1,106 @@
|
||||
# OCR Test Results - docTR+ Engine
|
||||
|
||||
**Date:** 2026-01-02 | **Receipts:** 26 | **Test:** Sequential
|
||||
|
||||
## Summary Comparison
|
||||
|
||||
| Workers | Avg | Total | Mem Used | Mem Avail |
|
||||
|---------|-----|-------|----------|-----------|
|
||||
| 1 | 6.8s | 176s | 3.2GB | 4.1GB |
|
||||
| 2 | 7.2s | 187s | 3.1GB | 4.1GB |
|
||||
| 3 | 6.8s | 176s | 3.9GB | 3.3GB |
|
||||
|
||||
**Success Rate:** 80.8% (21/26) - same for all configs
|
||||
|
||||
**Note:** For sequential tests, 1 worker ≈ 3 workers speed!
|
||||
Multiple workers only help with parallel requests.
|
||||
|
||||
## Detailed Results (1 Worker)
|
||||
|
||||
| # | Receipt | Time | Tier | Result | Notes |
|
||||
|---|---------|------|------|--------|-------|
|
||||
| 01 | abonament kineterra | 6.8s | T1 | ✓ | 97% |
|
||||
| 02 | benzina 14 august | 6.0s | T1 | ✓ | 83% |
|
||||
| 03 | benzina 27 octombrie | 5.9s | T1 | ✓ | 83% |
|
||||
| 04 | igiena 11 octombrie | 7.7s | T1 | ✓ | 97% |
|
||||
| 05 | igiena 14 dec five-holding | 11.5s | T1+T2 | ✗ | TOTAL ±1 |
|
||||
| 06 | rechizite 12 dec pictus | 5.9s | T1 | ✓ | 97% |
|
||||
| 07 | benzina 10 mai 2025 | 5.1s | T1 | ✓ | 83% |
|
||||
| 08 | brick consumabil 604 50% | 4.8s | T1 | ✓ | 97% |
|
||||
| 09 | benzina 13 septembrie | 4.9s | T1 | ✓ | 83% |
|
||||
| 10 | brick consumabile 604 | 5.3s | T1 | ✓ | 97% |
|
||||
| 11 | benzina 20 dec | 5.8s | T1 | ✓ | 79% |
|
||||
| 12 | bon fiscal Dedeman | 5.7s | T1 | ✓ | 90% |
|
||||
| 13 | factura Dedeman | 6.8s | T1 | ✓ | 97% |
|
||||
| 14 | benzina 13 iulie | 5.7s | T1 | ✓ | 95% |
|
||||
| 15 | best print stampila | 4.5s | T1 | ✓ | 94% |
|
||||
| 16 | electrobering telecomanda | 4.8s | T1 | ✓ | 97% |
|
||||
| 17 | brick igiena 8 oct | 11.9s | T1+T2 | ✗ | TOTAL/CUI |
|
||||
| 18 | gama ink refill toner | 5.9s | T1 | ✓ | 94% |
|
||||
| 19 | kineterra fizioterapie | 4.6s | T1 | ✓ | 97% |
|
||||
| 20 | brick igiena 1 sept | 12.5s | T1+T2 | ✗ | ALL None |
|
||||
| 21 | kineterra abonament | 5.6s | T1 | ✓ | 97% |
|
||||
| 22 | brick igiena electrice | 15.9s | T1+T2 | ✗ | DATE None |
|
||||
| 23 | electrobering igiena | 4.4s | T1 | ✓ | 97% |
|
||||
| 24 | Lidl papetarie 604 | 5.8s | T1 | ✓ | 87% |
|
||||
| 25 | brick igiena 604 | 6.8s | T1 | ✗ | DATE ±1 |
|
||||
| 26 | unlimited duplicat | 4.8s | T1 | ✓ | 86% |
|
||||
|
||||
## Time Comparison by Receipt
|
||||
|
||||
| # | Receipt | 1W | 2W | 3W |
|
||||
|---|---------|----|----|-----|
|
||||
| 01 | abonament kineterra | 6.8s | 6.7s | 5.8s |
|
||||
| 02 | benzina 14 august | 6.0s | 5.5s | 5.8s |
|
||||
| 03 | benzina 27 octombrie | 5.9s | 5.9s | 5.7s |
|
||||
| 04 | igiena 11 octombrie | 7.7s | 8.9s | 7.4s |
|
||||
| 05 | igiena 14 dec (FAIL) | 11.5s | 12.3s | 11.9s |
|
||||
| 06 | rechizite pictus | 5.9s | 5.9s | 5.7s |
|
||||
| 07 | benzina 10 mai | 5.1s | 6.0s | 5.8s |
|
||||
| 08 | brick 50% | 4.8s | 5.9s | 5.5s |
|
||||
| 09 | benzina 13 sept | 4.9s | 5.9s | 5.3s |
|
||||
| 10 | brick consumabile | 5.3s | 5.7s | 5.7s |
|
||||
| 11 | benzina 20 dec | 5.8s | 5.4s | 5.8s |
|
||||
| 12 | bon Dedeman | 5.7s | 5.9s | 5.8s |
|
||||
| 13 | factura Dedeman | 6.8s | 6.9s | 6.8s |
|
||||
| 14 | benzina 13 iulie | 5.7s | 6.1s | 5.4s |
|
||||
| 15 | best print | 4.5s | 5.8s | 4.8s |
|
||||
| 16 | electrobering | 4.8s | 4.2s | 4.7s |
|
||||
| 17 | brick 8 oct (FAIL) | 11.9s | 13.1s | 12.0s |
|
||||
| 18 | gama ink | 5.9s | 5.9s | 4.7s |
|
||||
| 19 | kineterra fizioterapie | 4.6s | 5.9s | 4.8s |
|
||||
| 20 | brick 1 sept (FAIL) | 12.5s | 13.2s | 13.1s |
|
||||
| 21 | kineterra abonament | 5.6s | 4.9s | 4.8s |
|
||||
| 22 | brick electrice (FAIL) | 15.9s | 17.0s | 15.5s |
|
||||
| 23 | electrobering igiena | 4.4s | 5.4s | 5.0s |
|
||||
| 24 | Lidl papetarie | 5.8s | 6.9s | 5.8s |
|
||||
| 25 | brick 604 (FAIL) | 6.8s | 6.5s | 6.9s |
|
||||
| 26 | unlimited duplicat | 4.8s | 5.8s | 5.0s |
|
||||
|---|---------|----|----|-----|
|
||||
| **AVG** | | **6.8s** | **7.2s** | **6.8s** |
|
||||
| **TOTAL** | | **176s** | **187s** | **176s** |
|
||||
|
||||
## Tier Analysis
|
||||
|
||||
- **T1 only (early exit):** 21 receipts (~5-6s)
|
||||
- **T1+T2 (full):** 5 receipts (~12-16s)
|
||||
|
||||
## Failures (5)
|
||||
|
||||
| Receipt | Issue | Fixable |
|
||||
|---------|-------|---------|
|
||||
| igiena 14 dec | TOTAL ±1 | No |
|
||||
| brick 8 oct | TOTAL/CUI | Maybe |
|
||||
| brick 1 sept | ALL None | No (bad doc) |
|
||||
| brick electrice | DATE None | Maybe |
|
||||
| brick 604 | DATE ±1 | No |
|
||||
|
||||
## Recommendation
|
||||
|
||||
```
|
||||
OCR_WORKERS=1 # Best for sequential, saves RAM
|
||||
OCR_WORKERS=2 # For parallel requests (production)
|
||||
OCR_MAX_TASKS_PER_CHILD=0 # No restart
|
||||
```
|
||||
|
||||
**For 8GB RAM:** Use 1-2 workers max
|
||||
Binary file not shown.
BIN
docs/data-entry/benzina 07 aug. 2024.pdf
Normal file
BIN
docs/data-entry/benzina 07 aug. 2024.pdf
Normal file
Binary file not shown.
BIN
docs/data-entry/benzina 10 mai 2025.pdf
Normal file
BIN
docs/data-entry/benzina 10 mai 2025.pdf
Normal file
Binary file not shown.
BIN
docs/data-entry/benzina 13 iulie.pdf
Normal file
BIN
docs/data-entry/benzina 13 iulie.pdf
Normal file
Binary file not shown.
BIN
docs/data-entry/benzina 13 septembrie .pdf
Normal file
BIN
docs/data-entry/benzina 13 septembrie .pdf
Normal file
Binary file not shown.
BIN
docs/data-entry/benzina 20 dec.pdf
Normal file
BIN
docs/data-entry/benzina 20 dec.pdf
Normal file
Binary file not shown.
BIN
docs/data-entry/best print stampila .pdf
Normal file
BIN
docs/data-entry/best print stampila .pdf
Normal file
Binary file not shown.
BIN
docs/data-entry/bon fiscal Dedeman - efactura.pdf
Normal file
BIN
docs/data-entry/bon fiscal Dedeman - efactura.pdf
Normal file
Binary file not shown.
BIN
docs/data-entry/brick consumabil 604 50% deductibil 22 dec.pdf
Normal file
BIN
docs/data-entry/brick consumabil 604 50% deductibil 22 dec.pdf
Normal file
Binary file not shown.
BIN
docs/data-entry/brick consumabile 604 22 dec.pdf
Normal file
BIN
docs/data-entry/brick consumabile 604 22 dec.pdf
Normal file
Binary file not shown.
6740
docs/data-entry/brick igiena 1 sept.pdf
Normal file
6740
docs/data-entry/brick igiena 1 sept.pdf
Normal file
File diff suppressed because it is too large
Load Diff
2552
docs/data-entry/brick igiena 604.pdf
Normal file
2552
docs/data-entry/brick igiena 604.pdf
Normal file
File diff suppressed because it is too large
Load Diff
2292
docs/data-entry/brick igiena 8 octombrie 98.95 lei card.pdf
Normal file
2292
docs/data-entry/brick igiena 8 octombrie 98.95 lei card.pdf
Normal file
File diff suppressed because it is too large
Load Diff
2610
docs/data-entry/brick igiena, electrice consumabile 604.pdf
Normal file
2610
docs/data-entry/brick igiena, electrice consumabile 604.pdf
Normal file
File diff suppressed because it is too large
Load Diff
BIN
docs/data-entry/electrobering igiena iulie 604.pdf
Normal file
BIN
docs/data-entry/electrobering igiena iulie 604.pdf
Normal file
Binary file not shown.
BIN
docs/data-entry/electrobering telecomanda.pdf
Normal file
BIN
docs/data-entry/electrobering telecomanda.pdf
Normal file
Binary file not shown.
1370
docs/data-entry/factura 70005116259 20.09.2025 Dedeman.pdf
Normal file
1370
docs/data-entry/factura 70005116259 20.09.2025 Dedeman.pdf
Normal file
File diff suppressed because it is too large
Load Diff
Binary file not shown.
BIN
docs/data-entry/kineterra abonament terapie august 2024.pdf
Normal file
BIN
docs/data-entry/kineterra abonament terapie august 2024.pdf
Normal file
Binary file not shown.
BIN
docs/data-entry/kineterra fizioterapie 9 sept.pdf
Normal file
BIN
docs/data-entry/kineterra fizioterapie 9 sept.pdf
Normal file
Binary file not shown.
BIN
docs/data-entry/stepout market carti tva 5%.pdf
Normal file
BIN
docs/data-entry/stepout market carti tva 5%.pdf
Normal file
Binary file not shown.
BIN
docs/data-entry/unlimited duplicat chei 23 mai.pdf
Normal file
BIN
docs/data-entry/unlimited duplicat chei 23 mai.pdf
Normal file
Binary file not shown.
Reference in New Issue
Block a user