From fcf272297414cdf7649cddab4b2b3b86a4193497 Mon Sep 17 00:00:00 2001 From: Claude Agent Date: Thu, 22 Jan 2026 09:21:07 +0000 Subject: [PATCH] refactor(docs): create troubleshooting folder, cleanup OCR docs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Create docs/troubleshooting/ for debugging guides - Move OCR_MEMORY_SOLUTIONS_RESEARCH.md → troubleshooting/OCR_MEMORY_LEAKS.md - Delete outdated docs/data-entry/OCR_PROFILE_TEST_RESULTS.md - Update CLAUDE.md documentation index with troubleshooting reference Co-Authored-By: Claude Opus 4.5 --- CLAUDE.md | 1 + docs/data-entry/OCR_PROFILE_TEST_RESULTS.md | 116 ------------------ .../OCR_MEMORY_LEAKS.md} | 0 3 files changed, 1 insertion(+), 116 deletions(-) delete mode 100644 docs/data-entry/OCR_PROFILE_TEST_RESULTS.md rename docs/{OCR_MEMORY_SOLUTIONS_RESEARCH.md => troubleshooting/OCR_MEMORY_LEAKS.md} (100%) diff --git a/CLAUDE.md b/CLAUDE.md index 097d0c2..2df4a52 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -165,6 +165,7 @@ class YourService: | **Data Entry** | `docs/data-entry/DATA-ENTRY-MODULE.md` | SQLModel, workflow | | **Telegram** | `docs/telegram/README.md` | Bot development | | **Deployment** | `deployment/linux/README.md` | Deploy from LXC | +| **Troubleshooting** | `docs/troubleshooting/` | Debugging guides, memory leaks | --- diff --git a/docs/data-entry/OCR_PROFILE_TEST_RESULTS.md b/docs/data-entry/OCR_PROFILE_TEST_RESULTS.md deleted file mode 100644 index fa18926..0000000 --- a/docs/data-entry/OCR_PROFILE_TEST_RESULTS.md +++ /dev/null @@ -1,116 +0,0 @@ -# OCR Profile Test Results - -**Date**: 2026-01-07 -**Test Script**: `scripts/test_all_profiles.py` -**Engine**: doctr_plus - -## Summary - -| Status | Count | -|--------|-------| -| ✅ Passed | 13 | -| ❌ Failed | 15 | -| ⏭️ Skipped | 0 | -| 💥 Errors | 1 | -| **Total** | **29** | - ---- - -## Passing Tests (13) - -1. `abonament kineterra.pdf` - Kineterra -2. `benzina 10 mai 2025.pdf` - OMV -3. `benzina 13 septembrie .pdf` - OMV ✓ (fixed payment) -4. `benzina 14 august.pdf` - OMV -5. `best print stampila .pdf` - Best Print -6. `brick consumabile 604 22 dec.pdf` - Brick ✓ (fixed) -7. `gama ink refill toner imprimanta 17 sept 2024.pdf` - Gama Ink ✓ (fixed) -8. `igiena 11 octombrie .pdf` - Brick ✓ (fixed) -9. `kineterra abonament terapie august 2024.pdf` - Kineterra -10. `kineterra fizioterapie 9 sept.pdf` - Kineterra -11. `Lidl personal 4 ianuarie .pdf` - Lidl -12. `rechizite 12 decembrie pictus.pdf` - Pictus -13. `unlimited duplicat chei 23 mai.pdf` - Unlimited Keys ✓ (fixed) - ---- - -## Failing Tests - Categorized - -### Category A: OCR Quality Issues (Cannot Fix) - -These failures are due to OCR misreading digits. Common patterns: -- `7` ↔ `2` confusion (1879855 → 1829865) -- `5` ↔ `3` confusion (1879855 → 1853855) -- Off-by-one dates -- Slight amount variations - -| File | Issue | Details | -|------|-------|---------| -| `benzina 27 octombrie .pdf` | Client CUI | Missing (OCR didn't capture) | -| `benzina 20 dec.pdf` | Client CUI + Total | CUI: 1853855→1879855, Total variance | -| `bon fiscal Dedeman - efactura.pdf` | Client CUI | 272714→1879855 (completely wrong) | -| `electrobering telecomanda.pdf` | Client CUI | 1829865→1879855 (2/7 confusion) | -| `electrobering igiena iulie 604.pdf` | Client CUI | RO1829865→RO1879855 | -| `benzina 13 iulie.pdf` | Client CUI | Missing (SOCAR) | -| `benzina 07 aug. 2024.pdf` | Multiple | Total/TVA/Date all off - multi-page PDF issue | - -### Category B: PDF Quality/Structure Issues - -| File | Issue | Details | -|------|-------|---------| -| `brick igiena 1 sept.pdf` | All fields missing | PDF likely corrupted or low quality | -| `brick igiena, electrice consumabile 604.pdf` | Decimal point | 19060.0 vs 190.6 - OCR misread decimal | -| `stepout market carti tva 5%.pdf` | Timeout | OCR taking too long (duplicate receipt in PDF) | - -### Category C: Expected Values May Need Update - -| File | Issue | Details | -|------|-------|---------| -| `igiena 14 decembrie five-holding.pdf` | Total off by 1.00 | 86.99 vs 85.99 - check expected value | -| `Lidl papetarie 604 fara TVA. nu are cod fiscal.pdf` | TVA off by 1.00 | 5.38 vs 6.38 - check expected value | -| `factura 70005116259 Dedeman.pdf` | Client CUI | Different buyer CUI (46598884 vs 1879855) | - -### Category D: Wrong Store Detected - -| File | Issue | Details | -|------|-------|---------| -| `brick igiena 8 octombrie 98.95 lei card.pdf` | Wrong CUI | Detected RO10604500, expected RO10562600. Different store on receipt? | - -### Category E: Profile Patterns Still Missing - -| File | Issue | Needed Fix | -|------|-------|------------| -| `brick igiena 604.pdf` | TVA not extracted | Different TVA format in this receipt | -| `brick consumabil 604 50% deductibil 22 dec.pdf` | Client CUI missing | OCR pattern not matching | -| `factura Dedeman.pdf` | TVA not extracted | Invoice format different from fiscal receipt | - ---- - -## Profiles Updated - -| Profile | Changes Made | -|---------|--------------| -| `brick.py` | Added client CUI, multiline TVA, CARD payment detection | -| `electrobering.py` | Added multiline TVA with double-dash handling | -| `stepout_market.py` | Complete rewrite for multiline format | -| `gama_ink.py` | Added multiline TVA, OCR "4" → "-" handling | -| `omv.py` | Added "CARTE CREDIT" payment detection | -| `socar.py` | Added "CARTE CREDIT" payment detection | -| `unlimited_keys.py` | (Previously fixed) TUA, NUMERAR, client CUI | - ---- - -## Recommendations - -1. **expected_receipts.json Update**: Some expected values may need verification: - - Check if `igiena 14 decembrie` total is really 85.99 or 86.99 - - Check if `Lidl papetarie` TVA is really 6.38 or 5.38 - - Verify `factura Dedeman` client CUI (different buyer) - -2. **Low-Quality PDFs**: Consider replacing: - - `brick igiena 1 sept.pdf` - appears corrupted - - `brick igiena, electrice consumabile 604.pdf` - decimal point issue - -3. **Acceptance Criteria**: For OCR-based extraction, ~80% accuracy is typical. - Current rate: 13/29 = 44.8% (with strict matching) - If excluding OCR quality issues: 13/20 = 65% (profile issues) diff --git a/docs/OCR_MEMORY_SOLUTIONS_RESEARCH.md b/docs/troubleshooting/OCR_MEMORY_LEAKS.md similarity index 100% rename from docs/OCR_MEMORY_SOLUTIONS_RESEARCH.md rename to docs/troubleshooting/OCR_MEMORY_LEAKS.md