Files
roa2web-service-auto/docs/data-entry/OCR_PROFILE_TEST_RESULTS.md
Claude Agent 28f259cd05 fix(ocr): Fix store profile extraction patterns and module loading
Major fixes to OCR store profiles for Romanian receipt extraction:

- Fix ProfileRegistry module path resolution (was loading 0 profiles)
- Add multiline TVA extraction for Brick, Electrobering, Gama Ink
- Add "CARTE CREDIT" payment detection for OMV/SOCAR gas stations
- Handle OCR artifacts: TVA→TUA, "-"→"4", I→L in CUI markers
- Add client CUI patterns for Brick receipts
- Add profile selection logging to ocr_extractor.py
- Create test script for all 29 PDFs (test_all_profiles.py)

Test results: 13/29 passing (improved from 9/29)
Remaining failures are primarily OCR quality issues.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-07 09:40:58 +00:00

4.4 KiB

OCR Profile Test Results

Date: 2026-01-07 Test Script: scripts/test_all_profiles.py Engine: doctr_plus

Summary

Status Count
Passed 13
Failed 15
⏭️ Skipped 0
💥 Errors 1
Total 29

Passing Tests (13)

  1. abonament kineterra.pdf - Kineterra
  2. benzina 10 mai 2025.pdf - OMV
  3. benzina 13 septembrie .pdf - OMV ✓ (fixed payment)
  4. benzina 14 august.pdf - OMV
  5. best print stampila .pdf - Best Print
  6. brick consumabile 604 22 dec.pdf - Brick ✓ (fixed)
  7. gama ink refill toner imprimanta 17 sept 2024.pdf - Gama Ink ✓ (fixed)
  8. igiena 11 octombrie .pdf - Brick ✓ (fixed)
  9. kineterra abonament terapie august 2024.pdf - Kineterra
  10. kineterra fizioterapie 9 sept.pdf - Kineterra
  11. Lidl personal 4 ianuarie .pdf - Lidl
  12. rechizite 12 decembrie pictus.pdf - Pictus
  13. unlimited duplicat chei 23 mai.pdf - Unlimited Keys ✓ (fixed)

Failing Tests - Categorized

Category A: OCR Quality Issues (Cannot Fix)

These failures are due to OCR misreading digits. Common patterns:

  • 72 confusion (1879855 → 1829865)
  • 53 confusion (1879855 → 1853855)
  • Off-by-one dates
  • Slight amount variations
File Issue Details
benzina 27 octombrie .pdf Client CUI Missing (OCR didn't capture)
benzina 20 dec.pdf Client CUI + Total CUI: 1853855→1879855, Total variance
bon fiscal Dedeman - efactura.pdf Client CUI 272714→1879855 (completely wrong)
electrobering telecomanda.pdf Client CUI 1829865→1879855 (2/7 confusion)
electrobering igiena iulie 604.pdf Client CUI RO1829865→RO1879855
benzina 13 iulie.pdf Client CUI Missing (SOCAR)
benzina 07 aug. 2024.pdf Multiple Total/TVA/Date all off - multi-page PDF issue

Category B: PDF Quality/Structure Issues

File Issue Details
brick igiena 1 sept.pdf All fields missing PDF likely corrupted or low quality
brick igiena, electrice consumabile 604.pdf Decimal point 19060.0 vs 190.6 - OCR misread decimal
stepout market carti tva 5%.pdf Timeout OCR taking too long (duplicate receipt in PDF)

Category C: Expected Values May Need Update

File Issue Details
igiena 14 decembrie five-holding.pdf Total off by 1.00 86.99 vs 85.99 - check expected value
Lidl papetarie 604 fara TVA. nu are cod fiscal.pdf TVA off by 1.00 5.38 vs 6.38 - check expected value
factura 70005116259 Dedeman.pdf Client CUI Different buyer CUI (46598884 vs 1879855)

Category D: Wrong Store Detected

File Issue Details
brick igiena 8 octombrie 98.95 lei card.pdf Wrong CUI Detected RO10604500, expected RO10562600. Different store on receipt?

Category E: Profile Patterns Still Missing

File Issue Needed Fix
brick igiena 604.pdf TVA not extracted Different TVA format in this receipt
brick consumabil 604 50% deductibil 22 dec.pdf Client CUI missing OCR pattern not matching
factura Dedeman.pdf TVA not extracted Invoice format different from fiscal receipt

Profiles Updated

Profile Changes Made
brick.py Added client CUI, multiline TVA, CARD payment detection
electrobering.py Added multiline TVA with double-dash handling
stepout_market.py Complete rewrite for multiline format
gama_ink.py Added multiline TVA, OCR "4" → "-" handling
omv.py Added "CARTE CREDIT" payment detection
socar.py Added "CARTE CREDIT" payment detection
unlimited_keys.py (Previously fixed) TUA, NUMERAR, client CUI

Recommendations

  1. expected_receipts.json Update: Some expected values may need verification:

    • Check if igiena 14 decembrie total is really 85.99 or 86.99
    • Check if Lidl papetarie TVA is really 6.38 or 5.38
    • Verify factura Dedeman client CUI (different buyer)
  2. Low-Quality PDFs: Consider replacing:

    • brick igiena 1 sept.pdf - appears corrupted
    • brick igiena, electrice consumabile 604.pdf - decimal point issue
  3. Acceptance Criteria: For OCR-based extraction, ~80% accuracy is typical. Current rate: 13/29 = 44.8% (with strict matching) If excluding OCR quality issues: 13/20 = 65% (profile issues)