refactor(docs): create troubleshooting folder, cleanup OCR docs
- Create docs/troubleshooting/ for debugging guides - Move OCR_MEMORY_SOLUTIONS_RESEARCH.md → troubleshooting/OCR_MEMORY_LEAKS.md - Delete outdated docs/data-entry/OCR_PROFILE_TEST_RESULTS.md - Update CLAUDE.md documentation index with troubleshooting reference Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -1,116 +0,0 @@
|
||||
# OCR Profile Test Results
|
||||
|
||||
**Date**: 2026-01-07
|
||||
**Test Script**: `scripts/test_all_profiles.py`
|
||||
**Engine**: doctr_plus
|
||||
|
||||
## Summary
|
||||
|
||||
| Status | Count |
|
||||
|--------|-------|
|
||||
| ✅ Passed | 13 |
|
||||
| ❌ Failed | 15 |
|
||||
| ⏭️ Skipped | 0 |
|
||||
| 💥 Errors | 1 |
|
||||
| **Total** | **29** |
|
||||
|
||||
---
|
||||
|
||||
## Passing Tests (13)
|
||||
|
||||
1. `abonament kineterra.pdf` - Kineterra
|
||||
2. `benzina 10 mai 2025.pdf` - OMV
|
||||
3. `benzina 13 septembrie .pdf` - OMV ✓ (fixed payment)
|
||||
4. `benzina 14 august.pdf` - OMV
|
||||
5. `best print stampila .pdf` - Best Print
|
||||
6. `brick consumabile 604 22 dec.pdf` - Brick ✓ (fixed)
|
||||
7. `gama ink refill toner imprimanta 17 sept 2024.pdf` - Gama Ink ✓ (fixed)
|
||||
8. `igiena 11 octombrie .pdf` - Brick ✓ (fixed)
|
||||
9. `kineterra abonament terapie august 2024.pdf` - Kineterra
|
||||
10. `kineterra fizioterapie 9 sept.pdf` - Kineterra
|
||||
11. `Lidl personal 4 ianuarie .pdf` - Lidl
|
||||
12. `rechizite 12 decembrie pictus.pdf` - Pictus
|
||||
13. `unlimited duplicat chei 23 mai.pdf` - Unlimited Keys ✓ (fixed)
|
||||
|
||||
---
|
||||
|
||||
## Failing Tests - Categorized
|
||||
|
||||
### Category A: OCR Quality Issues (Cannot Fix)
|
||||
|
||||
These failures are due to OCR misreading digits. Common patterns:
|
||||
- `7` ↔ `2` confusion (1879855 → 1829865)
|
||||
- `5` ↔ `3` confusion (1879855 → 1853855)
|
||||
- Off-by-one dates
|
||||
- Slight amount variations
|
||||
|
||||
| File | Issue | Details |
|
||||
|------|-------|---------|
|
||||
| `benzina 27 octombrie .pdf` | Client CUI | Missing (OCR didn't capture) |
|
||||
| `benzina 20 dec.pdf` | Client CUI + Total | CUI: 1853855→1879855, Total variance |
|
||||
| `bon fiscal Dedeman - efactura.pdf` | Client CUI | 272714→1879855 (completely wrong) |
|
||||
| `electrobering telecomanda.pdf` | Client CUI | 1829865→1879855 (2/7 confusion) |
|
||||
| `electrobering igiena iulie 604.pdf` | Client CUI | RO1829865→RO1879855 |
|
||||
| `benzina 13 iulie.pdf` | Client CUI | Missing (SOCAR) |
|
||||
| `benzina 07 aug. 2024.pdf` | Multiple | Total/TVA/Date all off - multi-page PDF issue |
|
||||
|
||||
### Category B: PDF Quality/Structure Issues
|
||||
|
||||
| File | Issue | Details |
|
||||
|------|-------|---------|
|
||||
| `brick igiena 1 sept.pdf` | All fields missing | PDF likely corrupted or low quality |
|
||||
| `brick igiena, electrice consumabile 604.pdf` | Decimal point | 19060.0 vs 190.6 - OCR misread decimal |
|
||||
| `stepout market carti tva 5%.pdf` | Timeout | OCR taking too long (duplicate receipt in PDF) |
|
||||
|
||||
### Category C: Expected Values May Need Update
|
||||
|
||||
| File | Issue | Details |
|
||||
|------|-------|---------|
|
||||
| `igiena 14 decembrie five-holding.pdf` | Total off by 1.00 | 86.99 vs 85.99 - check expected value |
|
||||
| `Lidl papetarie 604 fara TVA. nu are cod fiscal.pdf` | TVA off by 1.00 | 5.38 vs 6.38 - check expected value |
|
||||
| `factura 70005116259 Dedeman.pdf` | Client CUI | Different buyer CUI (46598884 vs 1879855) |
|
||||
|
||||
### Category D: Wrong Store Detected
|
||||
|
||||
| File | Issue | Details |
|
||||
|------|-------|---------|
|
||||
| `brick igiena 8 octombrie 98.95 lei card.pdf` | Wrong CUI | Detected RO10604500, expected RO10562600. Different store on receipt? |
|
||||
|
||||
### Category E: Profile Patterns Still Missing
|
||||
|
||||
| File | Issue | Needed Fix |
|
||||
|------|-------|------------|
|
||||
| `brick igiena 604.pdf` | TVA not extracted | Different TVA format in this receipt |
|
||||
| `brick consumabil 604 50% deductibil 22 dec.pdf` | Client CUI missing | OCR pattern not matching |
|
||||
| `factura Dedeman.pdf` | TVA not extracted | Invoice format different from fiscal receipt |
|
||||
|
||||
---
|
||||
|
||||
## Profiles Updated
|
||||
|
||||
| Profile | Changes Made |
|
||||
|---------|--------------|
|
||||
| `brick.py` | Added client CUI, multiline TVA, CARD payment detection |
|
||||
| `electrobering.py` | Added multiline TVA with double-dash handling |
|
||||
| `stepout_market.py` | Complete rewrite for multiline format |
|
||||
| `gama_ink.py` | Added multiline TVA, OCR "4" → "-" handling |
|
||||
| `omv.py` | Added "CARTE CREDIT" payment detection |
|
||||
| `socar.py` | Added "CARTE CREDIT" payment detection |
|
||||
| `unlimited_keys.py` | (Previously fixed) TUA, NUMERAR, client CUI |
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
1. **expected_receipts.json Update**: Some expected values may need verification:
|
||||
- Check if `igiena 14 decembrie` total is really 85.99 or 86.99
|
||||
- Check if `Lidl papetarie` TVA is really 6.38 or 5.38
|
||||
- Verify `factura Dedeman` client CUI (different buyer)
|
||||
|
||||
2. **Low-Quality PDFs**: Consider replacing:
|
||||
- `brick igiena 1 sept.pdf` - appears corrupted
|
||||
- `brick igiena, electrice consumabile 604.pdf` - decimal point issue
|
||||
|
||||
3. **Acceptance Criteria**: For OCR-based extraction, ~80% accuracy is typical.
|
||||
Current rate: 13/29 = 44.8% (with strict matching)
|
||||
If excluding OCR quality issues: 13/20 = 65% (profile issues)
|
||||
Reference in New Issue
Block a user