# OCR Profile Test Results **Date**: 2026-01-07 **Test Script**: `scripts/test_all_profiles.py` **Engine**: doctr_plus ## Summary | Status | Count | |--------|-------| | βœ… Passed | 13 | | ❌ Failed | 15 | | ⏭️ Skipped | 0 | | πŸ’₯ Errors | 1 | | **Total** | **29** | --- ## Passing Tests (13) 1. `abonament kineterra.pdf` - Kineterra 2. `benzina 10 mai 2025.pdf` - OMV 3. `benzina 13 septembrie .pdf` - OMV βœ“ (fixed payment) 4. `benzina 14 august.pdf` - OMV 5. `best print stampila .pdf` - Best Print 6. `brick consumabile 604 22 dec.pdf` - Brick βœ“ (fixed) 7. `gama ink refill toner imprimanta 17 sept 2024.pdf` - Gama Ink βœ“ (fixed) 8. `igiena 11 octombrie .pdf` - Brick βœ“ (fixed) 9. `kineterra abonament terapie august 2024.pdf` - Kineterra 10. `kineterra fizioterapie 9 sept.pdf` - Kineterra 11. `Lidl personal 4 ianuarie .pdf` - Lidl 12. `rechizite 12 decembrie pictus.pdf` - Pictus 13. `unlimited duplicat chei 23 mai.pdf` - Unlimited Keys βœ“ (fixed) --- ## Failing Tests - Categorized ### Category A: OCR Quality Issues (Cannot Fix) These failures are due to OCR misreading digits. Common patterns: - `7` ↔ `2` confusion (1879855 β†’ 1829865) - `5` ↔ `3` confusion (1879855 β†’ 1853855) - Off-by-one dates - Slight amount variations | File | Issue | Details | |------|-------|---------| | `benzina 27 octombrie .pdf` | Client CUI | Missing (OCR didn't capture) | | `benzina 20 dec.pdf` | Client CUI + Total | CUI: 1853855β†’1879855, Total variance | | `bon fiscal Dedeman - efactura.pdf` | Client CUI | 272714β†’1879855 (completely wrong) | | `electrobering telecomanda.pdf` | Client CUI | 1829865β†’1879855 (2/7 confusion) | | `electrobering igiena iulie 604.pdf` | Client CUI | RO1829865β†’RO1879855 | | `benzina 13 iulie.pdf` | Client CUI | Missing (SOCAR) | | `benzina 07 aug. 2024.pdf` | Multiple | Total/TVA/Date all off - multi-page PDF issue | ### Category B: PDF Quality/Structure Issues | File | Issue | Details | |------|-------|---------| | `brick igiena 1 sept.pdf` | All fields missing | PDF likely corrupted or low quality | | `brick igiena, electrice consumabile 604.pdf` | Decimal point | 19060.0 vs 190.6 - OCR misread decimal | | `stepout market carti tva 5%.pdf` | Timeout | OCR taking too long (duplicate receipt in PDF) | ### Category C: Expected Values May Need Update | File | Issue | Details | |------|-------|---------| | `igiena 14 decembrie five-holding.pdf` | Total off by 1.00 | 86.99 vs 85.99 - check expected value | | `Lidl papetarie 604 fara TVA. nu are cod fiscal.pdf` | TVA off by 1.00 | 5.38 vs 6.38 - check expected value | | `factura 70005116259 Dedeman.pdf` | Client CUI | Different buyer CUI (46598884 vs 1879855) | ### Category D: Wrong Store Detected | File | Issue | Details | |------|-------|---------| | `brick igiena 8 octombrie 98.95 lei card.pdf` | Wrong CUI | Detected RO10604500, expected RO10562600. Different store on receipt? | ### Category E: Profile Patterns Still Missing | File | Issue | Needed Fix | |------|-------|------------| | `brick igiena 604.pdf` | TVA not extracted | Different TVA format in this receipt | | `brick consumabil 604 50% deductibil 22 dec.pdf` | Client CUI missing | OCR pattern not matching | | `factura Dedeman.pdf` | TVA not extracted | Invoice format different from fiscal receipt | --- ## Profiles Updated | Profile | Changes Made | |---------|--------------| | `brick.py` | Added client CUI, multiline TVA, CARD payment detection | | `electrobering.py` | Added multiline TVA with double-dash handling | | `stepout_market.py` | Complete rewrite for multiline format | | `gama_ink.py` | Added multiline TVA, OCR "4" β†’ "-" handling | | `omv.py` | Added "CARTE CREDIT" payment detection | | `socar.py` | Added "CARTE CREDIT" payment detection | | `unlimited_keys.py` | (Previously fixed) TUA, NUMERAR, client CUI | --- ## Recommendations 1. **expected_receipts.json Update**: Some expected values may need verification: - Check if `igiena 14 decembrie` total is really 85.99 or 86.99 - Check if `Lidl papetarie` TVA is really 6.38 or 5.38 - Verify `factura Dedeman` client CUI (different buyer) 2. **Low-Quality PDFs**: Consider replacing: - `brick igiena 1 sept.pdf` - appears corrupted - `brick igiena, electrice consumabile 604.pdf` - decimal point issue 3. **Acceptance Criteria**: For OCR-based extraction, ~80% accuracy is typical. Current rate: 13/29 = 44.8% (with strict matching) If excluding OCR quality issues: 13/20 = 65% (profile issues)