refactor(docs): create troubleshooting folder, cleanup OCR docs

- Create docs/troubleshooting/ for debugging guides
- Move OCR_MEMORY_SOLUTIONS_RESEARCH.md → troubleshooting/OCR_MEMORY_LEAKS.md
- Delete outdated docs/data-entry/OCR_PROFILE_TEST_RESULTS.md
- Update CLAUDE.md documentation index with troubleshooting reference

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Claude Agent
2026-01-22 09:21:07 +00:00
parent 625fdcef5e
commit fcf2722974
3 changed files with 1 additions and 116 deletions

View File

@@ -165,6 +165,7 @@ class YourService:
| **Data Entry** | `docs/data-entry/DATA-ENTRY-MODULE.md` | SQLModel, workflow |
| **Telegram** | `docs/telegram/README.md` | Bot development |
| **Deployment** | `deployment/linux/README.md` | Deploy from LXC |
| **Troubleshooting** | `docs/troubleshooting/` | Debugging guides, memory leaks |
---

View File

@@ -1,116 +0,0 @@
# OCR Profile Test Results
**Date**: 2026-01-07
**Test Script**: `scripts/test_all_profiles.py`
**Engine**: doctr_plus
## Summary
| Status | Count |
|--------|-------|
| ✅ Passed | 13 |
| ❌ Failed | 15 |
| ⏭️ Skipped | 0 |
| 💥 Errors | 1 |
| **Total** | **29** |
---
## Passing Tests (13)
1. `abonament kineterra.pdf` - Kineterra
2. `benzina 10 mai 2025.pdf` - OMV
3. `benzina 13 septembrie .pdf` - OMV ✓ (fixed payment)
4. `benzina 14 august.pdf` - OMV
5. `best print stampila .pdf` - Best Print
6. `brick consumabile 604 22 dec.pdf` - Brick ✓ (fixed)
7. `gama ink refill toner imprimanta 17 sept 2024.pdf` - Gama Ink ✓ (fixed)
8. `igiena 11 octombrie .pdf` - Brick ✓ (fixed)
9. `kineterra abonament terapie august 2024.pdf` - Kineterra
10. `kineterra fizioterapie 9 sept.pdf` - Kineterra
11. `Lidl personal 4 ianuarie .pdf` - Lidl
12. `rechizite 12 decembrie pictus.pdf` - Pictus
13. `unlimited duplicat chei 23 mai.pdf` - Unlimited Keys ✓ (fixed)
---
## Failing Tests - Categorized
### Category A: OCR Quality Issues (Cannot Fix)
These failures are due to OCR misreading digits. Common patterns:
- `7``2` confusion (1879855 → 1829865)
- `5``3` confusion (1879855 → 1853855)
- Off-by-one dates
- Slight amount variations
| File | Issue | Details |
|------|-------|---------|
| `benzina 27 octombrie .pdf` | Client CUI | Missing (OCR didn't capture) |
| `benzina 20 dec.pdf` | Client CUI + Total | CUI: 1853855→1879855, Total variance |
| `bon fiscal Dedeman - efactura.pdf` | Client CUI | 272714→1879855 (completely wrong) |
| `electrobering telecomanda.pdf` | Client CUI | 1829865→1879855 (2/7 confusion) |
| `electrobering igiena iulie 604.pdf` | Client CUI | RO1829865→RO1879855 |
| `benzina 13 iulie.pdf` | Client CUI | Missing (SOCAR) |
| `benzina 07 aug. 2024.pdf` | Multiple | Total/TVA/Date all off - multi-page PDF issue |
### Category B: PDF Quality/Structure Issues
| File | Issue | Details |
|------|-------|---------|
| `brick igiena 1 sept.pdf` | All fields missing | PDF likely corrupted or low quality |
| `brick igiena, electrice consumabile 604.pdf` | Decimal point | 19060.0 vs 190.6 - OCR misread decimal |
| `stepout market carti tva 5%.pdf` | Timeout | OCR taking too long (duplicate receipt in PDF) |
### Category C: Expected Values May Need Update
| File | Issue | Details |
|------|-------|---------|
| `igiena 14 decembrie five-holding.pdf` | Total off by 1.00 | 86.99 vs 85.99 - check expected value |
| `Lidl papetarie 604 fara TVA. nu are cod fiscal.pdf` | TVA off by 1.00 | 5.38 vs 6.38 - check expected value |
| `factura 70005116259 Dedeman.pdf` | Client CUI | Different buyer CUI (46598884 vs 1879855) |
### Category D: Wrong Store Detected
| File | Issue | Details |
|------|-------|---------|
| `brick igiena 8 octombrie 98.95 lei card.pdf` | Wrong CUI | Detected RO10604500, expected RO10562600. Different store on receipt? |
### Category E: Profile Patterns Still Missing
| File | Issue | Needed Fix |
|------|-------|------------|
| `brick igiena 604.pdf` | TVA not extracted | Different TVA format in this receipt |
| `brick consumabil 604 50% deductibil 22 dec.pdf` | Client CUI missing | OCR pattern not matching |
| `factura Dedeman.pdf` | TVA not extracted | Invoice format different from fiscal receipt |
---
## Profiles Updated
| Profile | Changes Made |
|---------|--------------|
| `brick.py` | Added client CUI, multiline TVA, CARD payment detection |
| `electrobering.py` | Added multiline TVA with double-dash handling |
| `stepout_market.py` | Complete rewrite for multiline format |
| `gama_ink.py` | Added multiline TVA, OCR "4" → "-" handling |
| `omv.py` | Added "CARTE CREDIT" payment detection |
| `socar.py` | Added "CARTE CREDIT" payment detection |
| `unlimited_keys.py` | (Previously fixed) TUA, NUMERAR, client CUI |
---
## Recommendations
1. **expected_receipts.json Update**: Some expected values may need verification:
- Check if `igiena 14 decembrie` total is really 85.99 or 86.99
- Check if `Lidl papetarie` TVA is really 6.38 or 5.38
- Verify `factura Dedeman` client CUI (different buyer)
2. **Low-Quality PDFs**: Consider replacing:
- `brick igiena 1 sept.pdf` - appears corrupted
- `brick igiena, electrice consumabile 604.pdf` - decimal point issue
3. **Acceptance Criteria**: For OCR-based extraction, ~80% accuracy is typical.
Current rate: 13/29 = 44.8% (with strict matching)
If excluding OCR quality issues: 13/20 = 65% (profile issues)