feat(ocr): Add modular store profiles with hot-reload support
## Store Profiles System
- Add ProfileRegistry for CUI-based profile lookup
- Add BaseStoreProfile with generic extraction patterns
- Implement hot-reload via POST /api/data-entry/ocr/profiles/reload
## 12 Store Profiles
- LIDL: Multi-rate TVA (A, B, C, D codes)
- OMV, SOCAR: B2B with client CUI, YYYY.MM.DD dates
- BRICK, DEDEMAN: Standard TVA, e-factura support
- KINETERRA, BEST PRINT: Non-VAT payers (returns [])
- STEPOUT MARKET: TVA 5% (books/reduced rate)
- UNLIMITED KEYS: NUMERAR payment detection
- GAMA INK, ELECTROBERING, PICTUS VELUM: Standard TVA
## Flexible TVA Patterns
- All patterns use (\d{1,2})% to accept any rate
- Supports historical (19%, 9%, 5%) and current (21%, 11%)
## Payment Methods Fix
- Fixed base.py to support multiple payments of same type
- Changed deduplication from method-only to (method, amount) tuple
- Returns separate entries for split payments
## Tools
- Add generate_store_profile.py for automatic profile generation
- Analyzes PDFs via OCR API and detects patterns
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
258
backend/modules/data_entry/services/ocr/profiles/README.md
Normal file
258
backend/modules/data_entry/services/ocr/profiles/README.md
Normal file
@@ -0,0 +1,258 @@
|
||||
# Store Profiles - OCR Extraction
|
||||
|
||||
Sistem de profile specifice pentru extracție OCR cu hot-reload.
|
||||
|
||||
---
|
||||
|
||||
## Quick Start: Adaugă un profil nou
|
||||
|
||||
```bash
|
||||
# 1. Generează profil din PDF-uri (dry-run pentru preview)
|
||||
python scripts/generate_store_profile.py \
|
||||
--name "Magazin Nou SRL" \
|
||||
--cui "12345678" \
|
||||
--receipts "docs/data-entry/MagazinNou*.pdf" \
|
||||
--dry-run
|
||||
|
||||
# 2. Generează și salvează
|
||||
python scripts/generate_store_profile.py \
|
||||
--name "Magazin Nou SRL" \
|
||||
--cui "12345678" \
|
||||
--receipts "docs/data-entry/MagazinNou*.pdf" \
|
||||
--output backend/modules/data_entry/services/ocr/profiles/magazin_nou.py
|
||||
|
||||
# 3. Hot-reload (fără restart server)
|
||||
curl -X POST http://localhost:8000/api/data-entry/ocr/profiles/reload
|
||||
|
||||
# 4. Verifică
|
||||
curl http://localhost:8000/api/data-entry/ocr/profiles
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Structura directorului
|
||||
|
||||
```
|
||||
profiles/
|
||||
├── __init__.py # ProfileRegistry + hot-reload (~390 linii)
|
||||
├── base.py # BaseStoreProfile + pattern-uri generice (~410 linii)
|
||||
├── lidl.py # Multi-rate TVA (A/B)
|
||||
├── omv.py # B2B, date YYYY.MM.DD
|
||||
├── socar.py # B2B, date YYYY.MM.DD
|
||||
├── brick.py # Standard TVA
|
||||
├── dedeman.py # E-factura support
|
||||
├── kineterra.py # Non-VAT payer
|
||||
├── gama_ink.py # Standard TVA (toner/cartușe)
|
||||
├── electrobering.py # Standard TVA (electronice)
|
||||
├── pictus_velum.py # Standard TVA (rechizite)
|
||||
├── unlimited_keys.py # Standard TVA, NUMERAR payment
|
||||
├── best_print.py # Non-VAT payer (neplătitor TVA)
|
||||
├── stepout_market.py # TVA 5% (cărți/librărie)
|
||||
└── README.md # Acest fișier
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Profile existente (12 profile)
|
||||
|
||||
> **Note**: Pattern-urile TVA sunt **flexibile** și acceptă ORICE cotă (5%, 9%, 11%, 19%, 21%, etc.)
|
||||
> pentru a gestiona atât datele istorice cât și schimbările viitoare ale legislației.
|
||||
|
||||
| Magazin | CUI | Fișier | Caracteristici |
|
||||
|---------|-----|--------|----------------|
|
||||
| LIDL DISCOUNT S.R.L. | 22891860 | `lidl.py` | Multi-rate TVA (coduri A, B, C, D) |
|
||||
| OMV PETROM MARKETING S.R.L. | 11201891 | `omv.py` | B2B (client CUI), date YYYY.MM.DD |
|
||||
| SOCAR PETROLEUM S.A. | 12546600 | `socar.py` | B2B (client CUI), date YYYY.MM.DD |
|
||||
| FIVE-HOLDING S.A. (BRICK) | 10562600 | `brick.py` | Standard TVA |
|
||||
| DEDEMAN SRL | 2816464 | `dedeman.py` | E-factura support |
|
||||
| KINETERRA CONCEPT SRL | 31180432 | `kineterra.py` | Non-VAT payer (returnează `[]`) |
|
||||
| GAMA INK SERVICE SRL | 17741882 | `gama_ink.py` | Standard TVA (toner, cartușe) |
|
||||
| ELECTROBERING S.R.L. | 2744937 | `electrobering.py` | Standard TVA (electronice) |
|
||||
| PICTUS VELUM SRL | 39634534 | `pictus_velum.py` | Standard TVA (rechizite) |
|
||||
| UNLIMITED KEYS S.R.L. | 18993187 | `unlimited_keys.py` | Standard TVA, **NUMERAR** plată |
|
||||
| BEST PRINT TRADE ACTIV SRL | 45417955 | `best_print.py` | **Non-VAT payer** (neplătitor TVA) |
|
||||
| STEPOUT MARKET SRL | 35532655 | `stepout_market.py` | TVA 5% (cărți, librărie) |
|
||||
|
||||
---
|
||||
|
||||
## API Endpoints
|
||||
|
||||
| Endpoint | Metodă | Descriere |
|
||||
|----------|--------|-----------|
|
||||
| `/api/data-entry/ocr/profiles` | GET | Lista toate profilele |
|
||||
| `/api/data-entry/ocr/profiles/{cui}` | GET | Detalii profil (acceptă RO prefix) |
|
||||
| `/api/data-entry/ocr/profiles/reload` | POST | Hot-reload toate profilele |
|
||||
|
||||
### Exemple API
|
||||
|
||||
```bash
|
||||
# Lista profile
|
||||
curl http://localhost:8000/api/data-entry/ocr/profiles \
|
||||
-H "Authorization: Bearer <token>"
|
||||
|
||||
# Detalii profil (cu sau fără RO prefix)
|
||||
curl http://localhost:8000/api/data-entry/ocr/profiles/22891860
|
||||
curl http://localhost:8000/api/data-entry/ocr/profiles/RO22891860
|
||||
|
||||
# Hot-reload după modificări
|
||||
curl -X POST http://localhost:8000/api/data-entry/ocr/profiles/reload \
|
||||
-H "Authorization: Bearer <token>"
|
||||
|
||||
# Response reload:
|
||||
{
|
||||
"success": true,
|
||||
"reloaded_modules": 12,
|
||||
"profiles_count": 12,
|
||||
"registered_cuis": ["22891860", "11201891", "12546600", "10562600", ...],
|
||||
"last_reload": "2026-01-06T22:37:05.000000"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cum funcționează sistemul
|
||||
|
||||
### Flow de extracție
|
||||
|
||||
```
|
||||
ReceiptExtractor.extract()
|
||||
│
|
||||
├─► STEP 1: Extrage vendor + CUI
|
||||
│ └─► _extract_vendor(), _extract_cui()
|
||||
│
|
||||
├─► ProfileRegistry.get_profile(cui)
|
||||
│ └─► Returnează profil specific sau None
|
||||
│
|
||||
├─► STEP 2: Extracție cu profil (dacă există)
|
||||
│ ├─► profile.extract_total()
|
||||
│ ├─► profile.extract_date()
|
||||
│ ├─► profile.extract_receipt_number()
|
||||
│ ├─► profile.extract_tva_entries()
|
||||
│ ├─► profile.extract_payment_methods()
|
||||
│ └─► profile.extract_client_cui()
|
||||
│
|
||||
└─► STEP 3-4: Validare + post-procesare
|
||||
```
|
||||
|
||||
### Fallback
|
||||
|
||||
Dacă nu există profil pentru CUI, se folosește logica generică din `ReceiptExtractor`.
|
||||
|
||||
---
|
||||
|
||||
## Structura unui profil
|
||||
|
||||
```python
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
@ProfileRegistry.register
|
||||
class MagazinNouProfile(BaseStoreProfile):
|
||||
"""Docstring cu descriere magazin."""
|
||||
|
||||
CUI_LIST = ["12345678"] # Poate avea mai multe CUI-uri
|
||||
NAME_PATTERNS = ["MAGAZIN", "MAGAZIN NOU", "MAG4ZIN"] # OCR variants
|
||||
STORE_NAME = "Magazin Nou SRL"
|
||||
|
||||
# Override doar ce e diferit de base class
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
# Pattern-uri specifice magazinului
|
||||
...
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": True,
|
||||
"has_client_cui": False,
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pattern-uri disponibile în base.py
|
||||
|
||||
BaseStoreProfile include pattern-uri generice OCR-tolerant:
|
||||
|
||||
| Pattern | Descriere |
|
||||
|---------|-----------|
|
||||
| `TOTAL_PATTERNS` | 8 variante pentru TOTAL (TOTAL:, TOTAL DE PLATA, etc.) |
|
||||
| `DATE_PATTERNS` | 6 variante (DD.MM.YYYY, YYYY-MM-DD, DD/MM/YYYY) |
|
||||
| `DATE_PATTERNS_OCR_SPACES` | 4 variante cu spații OCR ("2025. 08. 14") |
|
||||
| `NUMBER_PATTERNS` | 11 variante pentru număr bon (NDS, BF, C3POS) |
|
||||
| `PAYMENT_PATTERNS` | 8 variante pentru CARD/NUMERAR |
|
||||
| `CLIENT_MARKERS` | 6 variante pentru secțiune CLIENT |
|
||||
| `CLIENT_CUI_PATTERNS` | 7 variante pentru CUI client |
|
||||
|
||||
### Metode implementate în base class
|
||||
|
||||
- `extract_total(text)` → `Tuple[Decimal, float]`
|
||||
- `extract_date(text)` → `Tuple[date, float]`
|
||||
- `extract_receipt_number(text)` → `Tuple[str, float]`
|
||||
- `extract_payment_methods(text)` → `List[dict]`
|
||||
- `extract_client_cui(text)` → `Tuple[str, float]`
|
||||
- `extract_client_name(text)` → `Tuple[str, float]`
|
||||
|
||||
---
|
||||
|
||||
## Când ai nevoie de profil custom?
|
||||
|
||||
| Situație | Exemplu | Ce trebuie override |
|
||||
|----------|---------|---------------------|
|
||||
| **Multi-rate TVA** | Lidl (TVA A, TVA B) | `extract_tva_entries()` |
|
||||
| **Format dată special** | OMV/Socar (YYYY.MM.DD) | `DATE_PATTERNS_OCR_SPACES` |
|
||||
| **B2B receipts** | Benzinării (au client CUI) | `extract_client_cui()` |
|
||||
| **Non-VAT payer** | Kineterra | `extract_tva_entries()` returnează `[]` |
|
||||
| **E-factura** | Dedeman | `extract_efactura_reference()` |
|
||||
|
||||
---
|
||||
|
||||
## Decizii de design
|
||||
|
||||
1. **Hot-reload manual** - endpoint `/profiles/reload` apelat când se modifică fișiere
|
||||
2. **Persistență în Python** - profile în Git, version controlled
|
||||
3. **Fallback graceful** - dacă nu există profil, folosește logica generică
|
||||
4. **CUI normalization** - gestionează automat prefixul "RO" și whitespace
|
||||
5. **Deduplicare TVA** - folosește `seen = set()` pentru a evita duplicate
|
||||
|
||||
---
|
||||
|
||||
## Comenzi utile
|
||||
|
||||
```bash
|
||||
# Verifică syntax Python pentru toate profilele
|
||||
for f in backend/modules/data_entry/services/ocr/profiles/*.py; do
|
||||
python3 -m py_compile "$f" && echo "✓ $(basename $f)"
|
||||
done
|
||||
|
||||
# Lista profile
|
||||
ls -la backend/modules/data_entry/services/ocr/profiles/
|
||||
|
||||
# Pornește backend pentru testare
|
||||
cd backend && source venv/bin/activate
|
||||
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 1
|
||||
|
||||
# Test OCR pe un PDF
|
||||
curl -X POST -F "file=@docs/data-entry/test.pdf" \
|
||||
-H "Authorization: Bearer <token>" \
|
||||
"http://localhost:8000/api/data-entry/ocr/extract?engine=doctr_plus"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Script generare profile
|
||||
|
||||
`scripts/generate_store_profile.py` - generator automat de profile
|
||||
|
||||
```bash
|
||||
# Vezi help
|
||||
python scripts/generate_store_profile.py --help
|
||||
|
||||
# Funcționalități:
|
||||
# - Analizează PDF-uri via OCR API
|
||||
# - Detectează: TVA format, date format, payment patterns, B2B
|
||||
# - Generează cod Python cu OCR error variants
|
||||
# - Suportă glob patterns (*.pdf)
|
||||
# - Verifică sintaxa după generare
|
||||
```
|
||||
388
backend/modules/data_entry/services/ocr/profiles/__init__.py
Normal file
388
backend/modules/data_entry/services/ocr/profiles/__init__.py
Normal file
@@ -0,0 +1,388 @@
|
||||
"""
|
||||
Store Profiles Registry with Hot-Reload Support.
|
||||
|
||||
This module provides a registry for store-specific OCR extraction profiles.
|
||||
Profiles can be reloaded at runtime without restarting the server.
|
||||
|
||||
Usage:
|
||||
from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
|
||||
|
||||
# Get profile for a CUI
|
||||
profile = ProfileRegistry.get_profile("22891860")
|
||||
if profile:
|
||||
tva_entries = profile.extract_tva_entries(text)
|
||||
|
||||
# Reload all profiles (after file changes)
|
||||
count = ProfileRegistry.reload_all()
|
||||
|
||||
Architecture:
|
||||
- ProfileRegistry: Singleton registry with class methods
|
||||
- BaseStoreProfile: Abstract base class for profiles
|
||||
- @ProfileRegistry.register: Decorator for profile classes
|
||||
|
||||
Hot-Reload Mechanism:
|
||||
1. Admin calls POST /profiles/reload endpoint
|
||||
2. Registry clears instance cache
|
||||
3. importlib.reload() re-executes each profile module
|
||||
4. @register decorator re-registers classes with new code
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import importlib
|
||||
import logging
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional, Type, TYPE_CHECKING
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .base import BaseStoreProfile
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Directory containing profile modules
|
||||
PROFILES_DIR = Path(__file__).parent
|
||||
|
||||
|
||||
class ProfileRegistry:
|
||||
"""
|
||||
Registry for store-specific OCR extraction profiles.
|
||||
|
||||
Uses class methods for singleton-like behavior without explicit instantiation.
|
||||
Supports hot-reload via importlib.reload() for runtime updates.
|
||||
|
||||
Attributes:
|
||||
_profiles: Maps CUI -> profile class (not instance)
|
||||
_instances: Maps CUI -> profile instance (lazy, cleared on reload)
|
||||
_last_reload: Timestamp of last reload
|
||||
_loaded: Whether initial load has been performed
|
||||
"""
|
||||
|
||||
# Class-level storage (singleton pattern via class methods)
|
||||
_profiles: Dict[str, Type["BaseStoreProfile"]] = {}
|
||||
_instances: Dict[str, "BaseStoreProfile"] = {}
|
||||
_last_reload: Optional[datetime] = None
|
||||
_loaded: bool = False
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Registration
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
@classmethod
|
||||
def register(cls, profile_class: Type["BaseStoreProfile"]) -> Type["BaseStoreProfile"]:
|
||||
"""
|
||||
Decorator to register a store profile class.
|
||||
|
||||
Registers the profile for all CUIs in the class's CUI_LIST.
|
||||
Safe for re-registration during hot-reload (overwrites existing).
|
||||
|
||||
Usage:
|
||||
@ProfileRegistry.register
|
||||
class LidlProfile(BaseStoreProfile):
|
||||
CUI_LIST = ["22891860"]
|
||||
...
|
||||
|
||||
Args:
|
||||
profile_class: Profile class to register
|
||||
|
||||
Returns:
|
||||
The same class (allows use as decorator)
|
||||
|
||||
Raises:
|
||||
ValueError: If CUI_LIST is empty
|
||||
"""
|
||||
cui_list = getattr(profile_class, 'CUI_LIST', [])
|
||||
store_name = getattr(profile_class, 'STORE_NAME', profile_class.__name__)
|
||||
|
||||
if not cui_list:
|
||||
logger.warning(f"Profile {profile_class.__name__} has empty CUI_LIST, skipping")
|
||||
return profile_class
|
||||
|
||||
# Register for each CUI
|
||||
for cui in cui_list:
|
||||
# Normalize CUI (remove RO prefix, strip whitespace)
|
||||
normalized_cui = cls._normalize_cui(cui)
|
||||
|
||||
if normalized_cui in cls._profiles:
|
||||
old_class = cls._profiles[normalized_cui]
|
||||
logger.debug(
|
||||
f"Re-registering CUI {normalized_cui}: "
|
||||
f"{old_class.__name__} -> {profile_class.__name__}"
|
||||
)
|
||||
# Clear cached instance for this CUI
|
||||
cls._instances.pop(normalized_cui, None)
|
||||
|
||||
cls._profiles[normalized_cui] = profile_class
|
||||
logger.debug(f"Registered profile {profile_class.__name__} for CUI {normalized_cui}")
|
||||
|
||||
logger.info(f"Registered {store_name} for CUIs: {cui_list}")
|
||||
return profile_class
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Lookup
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
@classmethod
|
||||
def get_profile(cls, cui: Optional[str]) -> Optional["BaseStoreProfile"]:
|
||||
"""
|
||||
Get profile instance for a CUI.
|
||||
|
||||
Uses lazy instantiation - creates instance on first access.
|
||||
Returns None if no profile is registered for this CUI.
|
||||
|
||||
Args:
|
||||
cui: CUI to lookup (with or without RO prefix)
|
||||
|
||||
Returns:
|
||||
Profile instance or None
|
||||
"""
|
||||
if not cui:
|
||||
return None
|
||||
|
||||
# Ensure profiles are loaded
|
||||
if not cls._loaded:
|
||||
cls._load_all_profiles()
|
||||
|
||||
normalized_cui = cls._normalize_cui(cui)
|
||||
|
||||
# Check if profile exists
|
||||
profile_class = cls._profiles.get(normalized_cui)
|
||||
if not profile_class:
|
||||
return None
|
||||
|
||||
# Lazy instantiation
|
||||
if normalized_cui not in cls._instances:
|
||||
try:
|
||||
cls._instances[normalized_cui] = profile_class()
|
||||
logger.debug(f"Instantiated {profile_class.__name__} for CUI {normalized_cui}")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to instantiate {profile_class.__name__}: {e}")
|
||||
return None
|
||||
|
||||
return cls._instances[normalized_cui]
|
||||
|
||||
@classmethod
|
||||
def has_profile(cls, cui: Optional[str]) -> bool:
|
||||
"""Check if a profile exists for this CUI."""
|
||||
if not cui:
|
||||
return False
|
||||
if not cls._loaded:
|
||||
cls._load_all_profiles()
|
||||
return cls._normalize_cui(cui) in cls._profiles
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Listing
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
@classmethod
|
||||
def list_profiles(cls) -> List[Dict]:
|
||||
"""
|
||||
List all registered profiles.
|
||||
|
||||
Returns:
|
||||
List of dicts with cui, class_name, store_name, name_patterns
|
||||
"""
|
||||
if not cls._loaded:
|
||||
cls._load_all_profiles()
|
||||
|
||||
result = []
|
||||
seen_classes = set()
|
||||
|
||||
for cui, profile_class in cls._profiles.items():
|
||||
# Avoid duplicates for profiles with multiple CUIs
|
||||
if profile_class.__name__ in seen_classes:
|
||||
continue
|
||||
seen_classes.add(profile_class.__name__)
|
||||
|
||||
result.append({
|
||||
"cuis": list(getattr(profile_class, 'CUI_LIST', [])),
|
||||
"class_name": profile_class.__name__,
|
||||
"store_name": getattr(profile_class, 'STORE_NAME', profile_class.__name__),
|
||||
"name_patterns": list(getattr(profile_class, 'NAME_PATTERNS', [])),
|
||||
})
|
||||
|
||||
return result
|
||||
|
||||
@classmethod
|
||||
def get_profile_info(cls, cui: str) -> Optional[Dict]:
|
||||
"""
|
||||
Get detailed info about a profile.
|
||||
|
||||
Args:
|
||||
cui: CUI to lookup
|
||||
|
||||
Returns:
|
||||
Dict with profile details or None
|
||||
"""
|
||||
profile = cls.get_profile(cui)
|
||||
if not profile:
|
||||
return None
|
||||
|
||||
return {
|
||||
"cui": cui,
|
||||
"cuis": list(profile.CUI_LIST),
|
||||
"class_name": profile.__class__.__name__,
|
||||
"store_name": profile.STORE_NAME,
|
||||
"name_patterns": list(profile.NAME_PATTERNS),
|
||||
"validation_hints": profile.get_validation_hints(),
|
||||
}
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Hot-Reload
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
@classmethod
|
||||
def reload_all(cls) -> int:
|
||||
"""
|
||||
Hot-reload all profile modules.
|
||||
|
||||
Clears instance cache and reloads all .py files in profiles directory.
|
||||
Decorator re-registers classes with updated code.
|
||||
|
||||
Returns:
|
||||
Number of modules reloaded
|
||||
"""
|
||||
logger.info("Starting profile hot-reload...")
|
||||
|
||||
# Clear instance cache (will be recreated on next get_profile)
|
||||
cls._instances.clear()
|
||||
|
||||
# Get list of profile modules (exclude __init__, base)
|
||||
module_names = cls._get_profile_module_names()
|
||||
|
||||
count = 0
|
||||
for module_name in module_names:
|
||||
full_name = f"backend.modules.data_entry.services.ocr.profiles.{module_name}"
|
||||
|
||||
try:
|
||||
if full_name in sys.modules:
|
||||
# Reload existing module
|
||||
importlib.reload(sys.modules[full_name])
|
||||
logger.debug(f"Reloaded module: {module_name}")
|
||||
else:
|
||||
# Import new module
|
||||
importlib.import_module(full_name)
|
||||
logger.debug(f"Imported new module: {module_name}")
|
||||
count += 1
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to reload {module_name}: {e}")
|
||||
|
||||
cls._last_reload = datetime.utcnow()
|
||||
cls._loaded = True
|
||||
|
||||
logger.info(f"Profile hot-reload complete: {count} modules, {len(cls._profiles)} profiles")
|
||||
return count
|
||||
|
||||
@classmethod
|
||||
def get_reload_status(cls) -> Dict:
|
||||
"""Get status of the registry including last reload time."""
|
||||
return {
|
||||
"loaded": cls._loaded,
|
||||
"last_reload": cls._last_reload.isoformat() if cls._last_reload else None,
|
||||
"profiles_count": len(cls._profiles),
|
||||
"instances_count": len(cls._instances),
|
||||
"registered_cuis": list(cls._profiles.keys()),
|
||||
}
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Internal methods
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
@classmethod
|
||||
def _normalize_cui(cls, cui: str) -> str:
|
||||
"""
|
||||
Normalize CUI for consistent lookup.
|
||||
|
||||
- Removes RO prefix (with or without space)
|
||||
- Strips whitespace
|
||||
- Converts to uppercase
|
||||
|
||||
Args:
|
||||
cui: Raw CUI string
|
||||
|
||||
Returns:
|
||||
Normalized CUI (digits only)
|
||||
"""
|
||||
if not cui:
|
||||
return ""
|
||||
|
||||
cui = str(cui).strip().upper()
|
||||
|
||||
# Remove RO prefix (handles "RO12345" and "RO 12345")
|
||||
if cui.startswith("RO"):
|
||||
cui = cui[2:].lstrip()
|
||||
|
||||
return cui.strip()
|
||||
|
||||
@classmethod
|
||||
def _get_profile_module_names(cls) -> List[str]:
|
||||
"""
|
||||
Get list of profile module names from profiles directory.
|
||||
|
||||
Excludes __init__.py and base.py.
|
||||
|
||||
Returns:
|
||||
List of module names (without .py extension)
|
||||
"""
|
||||
excluded = {"__init__", "base", "__pycache__"}
|
||||
modules = []
|
||||
|
||||
for path in PROFILES_DIR.glob("*.py"):
|
||||
name = path.stem
|
||||
if name not in excluded:
|
||||
modules.append(name)
|
||||
|
||||
return sorted(modules)
|
||||
|
||||
@classmethod
|
||||
def _load_all_profiles(cls) -> None:
|
||||
"""
|
||||
Initial load of all profile modules.
|
||||
|
||||
Called automatically on first get_profile() if not already loaded.
|
||||
"""
|
||||
if cls._loaded:
|
||||
return
|
||||
|
||||
logger.info("Loading store profiles...")
|
||||
|
||||
module_names = cls._get_profile_module_names()
|
||||
|
||||
for module_name in module_names:
|
||||
full_name = f"backend.modules.data_entry.services.ocr.profiles.{module_name}"
|
||||
try:
|
||||
importlib.import_module(full_name)
|
||||
logger.debug(f"Loaded module: {module_name}")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to load {module_name}: {e}")
|
||||
|
||||
cls._loaded = True
|
||||
cls._last_reload = datetime.utcnow()
|
||||
|
||||
logger.info(f"Loaded {len(cls._profiles)} store profiles")
|
||||
|
||||
@classmethod
|
||||
def clear(cls) -> None:
|
||||
"""
|
||||
Clear all registered profiles.
|
||||
|
||||
Mainly useful for testing.
|
||||
"""
|
||||
cls._profiles.clear()
|
||||
cls._instances.clear()
|
||||
cls._loaded = False
|
||||
cls._last_reload = None
|
||||
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Module exports
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
__all__ = [
|
||||
"ProfileRegistry",
|
||||
"BaseStoreProfile",
|
||||
]
|
||||
|
||||
# Re-export BaseStoreProfile for convenience
|
||||
from .base import BaseStoreProfile
|
||||
515
backend/modules/data_entry/services/ocr/profiles/base.py
Normal file
515
backend/modules/data_entry/services/ocr/profiles/base.py
Normal file
@@ -0,0 +1,515 @@
|
||||
"""
|
||||
Base class for store-specific OCR extraction profiles.
|
||||
|
||||
Each store can have different receipt formats (TVA layout, total position, etc.).
|
||||
Store profiles allow customizing extraction logic per-store for better accuracy.
|
||||
|
||||
Usage:
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
@ProfileRegistry.register
|
||||
class LidlProfile(BaseStoreProfile):
|
||||
CUI_LIST = ["22891860"]
|
||||
NAME_PATTERNS = ["LIDL", "LDL"]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
# Custom Lidl TVA extraction logic
|
||||
...
|
||||
"""
|
||||
|
||||
import re
|
||||
from abc import ABC
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Optional, Tuple, Dict, Any
|
||||
from datetime import date
|
||||
|
||||
|
||||
class BaseStoreProfile(ABC):
|
||||
"""
|
||||
Abstract base class for store-specific extraction profiles.
|
||||
|
||||
Each profile defines:
|
||||
- CUI_LIST: CUI codes that identify this store (without RO prefix)
|
||||
- NAME_PATTERNS: OCR-tolerant name patterns for fallback matching
|
||||
- Custom extraction methods for TVA, total, date, etc.
|
||||
|
||||
The ProfileRegistry uses CUI_LIST to lookup profiles during extraction.
|
||||
"""
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Class attributes - override in subclasses
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
# List of CUI codes (without RO prefix) that identify this store
|
||||
CUI_LIST: List[str] = []
|
||||
|
||||
# OCR-tolerant name patterns for fallback matching
|
||||
NAME_PATTERNS: List[str] = []
|
||||
|
||||
# Store display name
|
||||
STORE_NAME: str = "Unknown Store"
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Generic patterns - can be overridden in subclasses
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
# Total amount patterns (confidence-weighted)
|
||||
TOTAL_PATTERNS = [
|
||||
(r'T[O0]TAL[.\s]+L[E3][I1!]\s*:?\s*([\d\s.,]+)', 0.98),
|
||||
(r'TOTAL\s+LEI\s*([\d\s.,]+)', 0.98),
|
||||
(r'[OT]?OTAL\s+LEI\s*([\d\s.,]+)', 0.95),
|
||||
(r'TOTAL\s*:?\s*([\d\s.,]+)\s*(?:RON|LEI)?', 0.95),
|
||||
(r'TOTAL\s+(?:RON|LEI)\s*([\d\s.,]+)', 0.95),
|
||||
(r'SUBTOTAL\s*([\d\s.,]+)', 0.90),
|
||||
(r'DE\s+PLATA\s*:?\s*([\d\s.,]+)', 0.90),
|
||||
(r'SUMA\s*:?\s*([\d\s.,]+)', 0.85),
|
||||
]
|
||||
|
||||
# Date patterns (confidence-weighted)
|
||||
DATE_PATTERNS = [
|
||||
(r'D[AR]TA\s*:?\s*(\d{2}[-./]\d{2}[-./]\d{4})', 0.98),
|
||||
(r'DATA\s*:?\s*(\d{2}[-./]\d{2}[-./]\d{4})', 0.98),
|
||||
(r'(\d{2}[-./]\d{2}[-./]\d{4})\s+[O0]RA\s*:?\s*\d{2}:\d{2}', 0.95),
|
||||
(r'(\d{2}[-./]\d{2}[-./]\d{4})\s+\d{2}:\d{2}', 0.90),
|
||||
(r'(\d{2}[-./]\d{2}[-./]\d{4})', 0.80),
|
||||
(r'(\d{4}[-./]\d{2}[-./]\d{2})', 0.75),
|
||||
]
|
||||
|
||||
# Date patterns with OCR-introduced spaces (separate because format is different)
|
||||
DATE_PATTERNS_OCR_SPACES = [
|
||||
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})\s+\d{2}:\d{2}', 0.92, 'ymd'),
|
||||
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})', 0.85, 'ymd'),
|
||||
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})\s+\d{2}:\d{2}', 0.92, 'dmy'),
|
||||
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})', 0.85, 'dmy'),
|
||||
]
|
||||
|
||||
# Receipt number patterns (confidence-weighted)
|
||||
NUMBER_PATTERNS = [
|
||||
(r'NDS\s*:?\s*(\d+)', 0.98),
|
||||
(r'C3POS[-A-Z0-9]*[N:](\d{6,7})', 0.98),
|
||||
(r'C3POS.*?(\d{6,7})\b', 0.95),
|
||||
(r'BF\s*:\s*(\d{4,})', 0.96),
|
||||
(r'BF\s+(\d{4,})', 0.93),
|
||||
(r'NIVS\s*:?\s*(\d+)', 0.95),
|
||||
(r'NR\.?\s*BON\s*:?\s*(\d+)', 0.95),
|
||||
(r'BON\s+(?:FISCAL\s+)?NR\.?\s*:?\s*(\d+)', 0.95),
|
||||
(r'CHITANTA\s+NR\.?\s*:?\s*(\d+)', 0.95),
|
||||
(r'NR\.?\s+DOCUMENT\s*:?\s*(\d+)', 0.90),
|
||||
(r'ID\s*BF\s*:?\s*(\d+)', 0.90),
|
||||
]
|
||||
|
||||
# Payment method patterns (pattern, method_type, confidence)
|
||||
PAYMENT_PATTERNS = [
|
||||
(r'CARTE\s+CREDIT\s*:?\s*([\d\s.,]+)', 'CARD', 0.98),
|
||||
(r'CARTE\s+CREDIT\s*:?\s*\n\s*([\d\s.,]+)', 'CARD', 0.97),
|
||||
(r'(?:PLATA\s+)?CARD\s*[:\sA-Z]?\s*([\d\s.,]+)', 'CARD', 0.95),
|
||||
(r'NUMERAR\s*:?\s*([\d\s.,]+)', 'NUMERAR', 0.95),
|
||||
(r'CASH\s*:?\s*([\d\s.,]+)', 'NUMERAR', 0.90),
|
||||
(r'(?:^|\n|\s)RD\s*:?\s*(\d{1,6}[.,]\d{2})\b', 'CARD', 0.70),
|
||||
(r'(?:^|\n|\s)ARD\s*:?\s*(\d{1,6}[.,]\d{2})\b', 'CARD', 0.75),
|
||||
(r'(?:^|\n|\s)MERAR\s*:?\s*(\d{1,6}[.,]\d{2})\b', 'NUMERAR', 0.70),
|
||||
]
|
||||
|
||||
# Client section markers (for B2B receipts)
|
||||
CLIENT_MARKERS = [
|
||||
r'C\.?\s*[I1]\.?\s*F\.?\s+CLIENT\s*:',
|
||||
r'C\.?\s*U\.?\s*[I1]\.?\s+CLIENT\s*:',
|
||||
r'CLIENT\s+C\.?\s*[UI1]\.?\s*[IF1]\.?\s*:',
|
||||
r'CLIENT\s*:',
|
||||
r'CUMPARATOR\s*:',
|
||||
r'BENEFICIAR\s*:',
|
||||
]
|
||||
|
||||
# Client CUI patterns (pattern, confidence)
|
||||
CLIENT_CUI_PATTERNS = [
|
||||
(r'(R[O0]\d{6,10})\s*\n\s*CLIENT\s+C\.?\s*U\.?\s*[I1]\.?', 0.99),
|
||||
(r'(R[O0]\d{6,10})\s*:?\s*\n\s*CLIENT', 0.98),
|
||||
(r'C[I1]F\s+[A-Z]*\s*CLIENT\s*:?\s*(R[O0]\d{6,10})', 0.98),
|
||||
(r'C\.?\s*[I1]\.?\s*F\.?\s+CLIENT\s*:?\s*(R[O0]?\d{6,10})', 0.98),
|
||||
(r'C\.?\s*U\.?\s*[I1]\.?\s+CLIENT\s*:?\s*(R[O0]?\d{6,10})', 0.98),
|
||||
(r'CLIENT\s+C\.?\s*U\.?\s*[I1]\.?\s*:?\s*(R[O0]?\d{6,10})', 0.95),
|
||||
(r'CLIENT\s*:?\s*(R[O0]?\d{6,10})', 0.90),
|
||||
]
|
||||
|
||||
# Company type indicators (for identifying company names)
|
||||
COMPANY_INDICATORS = [
|
||||
r'\bS\.?\s*R\.?\s*L\.?\b', # S.R.L. or S. R. L.
|
||||
r'\bS\.?\s*A\.?\b', # S.A. or S. A.
|
||||
r'\bS\.?\s*N\.?\s*C\.?\b', # S.N.C. or S. N. C.
|
||||
r'\bS\.?\s*C\.?\s*S\.?\b', # S.C.S. or S. C. S.
|
||||
r'\bI\.?\s*I\.?\b', # I.I. or I. I.
|
||||
r'\bP\.?\s*F\.?\s*A\.?\b', # P.F.A. or P. F. A.
|
||||
r'\bS\.?\s*C\.?\s+[A-Z]', # S.C. followed by company name
|
||||
r'HOLDING',
|
||||
r'COMPANY',
|
||||
r'GROUP',
|
||||
]
|
||||
|
||||
# Maximum reasonable payment amount (to filter OCR errors)
|
||||
MAX_PAYMENT = Decimal('100000')
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Extraction methods - override in subclasses as needed
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract TVA entries from receipt text.
|
||||
|
||||
Override this method in subclasses to handle store-specific TVA formats.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of dicts with keys: code, percent, amount
|
||||
"""
|
||||
return []
|
||||
|
||||
def extract_total(self, text: str) -> Tuple[Optional[Decimal], float]:
|
||||
"""
|
||||
Extract total amount from receipt text.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
Tuple of (amount, confidence) or (None, 0.0)
|
||||
"""
|
||||
text_upper = text.upper()
|
||||
|
||||
for pattern, confidence in self.TOTAL_PATTERNS:
|
||||
match = re.search(pattern, text_upper)
|
||||
if match:
|
||||
amount = self._parse_decimal(match.group(1))
|
||||
if amount and amount > 0 and amount < self.MAX_PAYMENT:
|
||||
return (amount, confidence)
|
||||
|
||||
return (None, 0.0)
|
||||
|
||||
def extract_date(self, text: str) -> Tuple[Optional[date], float]:
|
||||
"""
|
||||
Extract receipt date from text.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
Tuple of (date, confidence) or (None, 0.0)
|
||||
"""
|
||||
text_upper = text.upper()
|
||||
|
||||
# Try standard patterns first
|
||||
for pattern, confidence in self.DATE_PATTERNS:
|
||||
match = re.search(pattern, text_upper)
|
||||
if match:
|
||||
parsed = self._parse_date(match.group(1))
|
||||
if parsed:
|
||||
return (parsed, confidence)
|
||||
|
||||
# Try OCR-corrupted patterns with spaces
|
||||
for pattern, confidence, fmt in self.DATE_PATTERNS_OCR_SPACES:
|
||||
match = re.search(pattern, text_upper)
|
||||
if match:
|
||||
try:
|
||||
if fmt == 'ymd':
|
||||
year, month, day = int(match.group(1)), int(match.group(2)), int(match.group(3))
|
||||
else: # dmy
|
||||
day, month, year = int(match.group(1)), int(match.group(2)), int(match.group(3))
|
||||
|
||||
if 1 <= day <= 31 and 1 <= month <= 12 and 2000 <= year <= 2100:
|
||||
return (date(year, month, day), confidence)
|
||||
except (ValueError, TypeError):
|
||||
continue
|
||||
|
||||
return (None, 0.0)
|
||||
|
||||
def extract_receipt_number(self, text: str) -> Tuple[Optional[str], float]:
|
||||
"""
|
||||
Extract receipt number from text.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
Tuple of (number, confidence) or (None, 0.0)
|
||||
"""
|
||||
text_upper = text.upper()
|
||||
|
||||
for pattern, confidence in self.NUMBER_PATTERNS:
|
||||
match = re.search(pattern, text_upper)
|
||||
if match:
|
||||
number = match.group(1).strip()
|
||||
if number and len(number) >= 3:
|
||||
return (number, confidence)
|
||||
|
||||
return (None, 0.0)
|
||||
|
||||
def extract_payment_methods(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract payment methods (CARD/NUMERAR) from receipt.
|
||||
|
||||
Supports multiple payments of the same type (e.g., 2x CARD for split payments).
|
||||
Each payment is returned as a separate entry with its amount.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of dicts: [{'method': 'CARD'/'NUMERAR', 'amount': Decimal, 'confidence': float}]
|
||||
Multiple entries of same method type are allowed for split payments.
|
||||
"""
|
||||
text_upper = text.upper()
|
||||
methods = []
|
||||
# Track (method, amount) pairs to avoid exact duplicates from overlapping patterns
|
||||
seen_entries = set()
|
||||
|
||||
for pattern, method, confidence in self.PAYMENT_PATTERNS:
|
||||
for match in re.finditer(pattern, text_upper):
|
||||
try:
|
||||
amount = self._parse_decimal(match.group(1))
|
||||
if amount and amount > 0 and amount < self.MAX_PAYMENT:
|
||||
# Deduplicate by (method, amount) to avoid same entry from multiple patterns
|
||||
# But allow different amounts for same method (split payments)
|
||||
entry_key = (method, amount)
|
||||
if entry_key not in seen_entries:
|
||||
methods.append({
|
||||
'method': method,
|
||||
'amount': amount,
|
||||
'confidence': confidence
|
||||
})
|
||||
seen_entries.add(entry_key)
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return methods
|
||||
|
||||
def extract_client_cui(self, text: str) -> Tuple[Optional[str], float]:
|
||||
"""
|
||||
Extract client CUI from B2B receipts.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
Tuple of (cui, confidence) or (None, 0.0)
|
||||
"""
|
||||
text_upper = text.upper()
|
||||
|
||||
# First check if there's a CLIENT section
|
||||
has_client_section = any(
|
||||
re.search(marker, text_upper, re.IGNORECASE)
|
||||
for marker in self.CLIENT_MARKERS
|
||||
)
|
||||
|
||||
if not has_client_section:
|
||||
return (None, 0.0)
|
||||
|
||||
# Try to extract CUI
|
||||
for pattern, confidence in self.CLIENT_CUI_PATTERNS:
|
||||
match = re.search(pattern, text_upper, re.IGNORECASE | re.MULTILINE)
|
||||
if match:
|
||||
cui = match.group(1)
|
||||
# Normalize: remove RO prefix for storage
|
||||
cui_digits = re.sub(r'[^0-9]', '', cui)
|
||||
if 6 <= len(cui_digits) <= 10:
|
||||
return (cui_digits, confidence)
|
||||
|
||||
return (None, 0.0)
|
||||
|
||||
def extract_client_name(self, text: str) -> Tuple[Optional[str], float]:
|
||||
"""
|
||||
Extract client/buyer company name from B2B receipts.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
Tuple of (client_name, confidence) or (None, 0.0)
|
||||
"""
|
||||
text_upper = text.upper()
|
||||
lines = text.split('\n')
|
||||
|
||||
# First check if there's a CLIENT section
|
||||
client_section_idx = None
|
||||
for i, line in enumerate(lines):
|
||||
line_upper = line.upper().strip()
|
||||
if any(re.search(marker, line_upper, re.IGNORECASE) for marker in self.CLIENT_MARKERS):
|
||||
client_section_idx = i
|
||||
break
|
||||
|
||||
if client_section_idx is None:
|
||||
return (None, 0.0)
|
||||
|
||||
# Look for company name in CLIENT section
|
||||
line = lines[client_section_idx].strip()
|
||||
line_upper = line.upper()
|
||||
|
||||
# Strategy 1: Check if name is on same line after ":"
|
||||
if ':' in line:
|
||||
name_part = line.split(':', 1)[1].strip()
|
||||
if name_part and len(name_part) >= 3:
|
||||
# Skip if it looks like a CUI (RO followed by digits)
|
||||
if re.match(r'^R[O0]?\d{6,10}$', name_part.upper()):
|
||||
pass # This is CUI, not name - continue to next strategy
|
||||
else:
|
||||
# Check for company indicators
|
||||
name_upper = name_part.upper()
|
||||
if any(re.search(ind, name_upper) for ind in self.COMPANY_INDICATORS):
|
||||
return (self._clean_company_name(name_part), 0.95)
|
||||
elif len(name_part) >= 5 and not name_part.isdigit():
|
||||
return (self._clean_company_name(name_part), 0.80)
|
||||
|
||||
# Strategy 2: Check next line for company name
|
||||
if client_section_idx + 1 < len(lines):
|
||||
next_line = lines[client_section_idx + 1].strip()
|
||||
next_upper = next_line.upper()
|
||||
|
||||
# Skip if it's a CUI/CIF line or looks like CUI
|
||||
if not re.search(r'C\.?\s*[UI]\.?\s*[IF]\.?', next_upper):
|
||||
if not re.match(r'^R[O0]?\d{6,10}$', next_upper):
|
||||
if any(re.search(ind, next_upper) for ind in self.COMPANY_INDICATORS):
|
||||
return (self._clean_company_name(next_line), 0.90)
|
||||
elif len(next_line) >= 5 and not next_line.isdigit():
|
||||
# Check it's not CUI/CIF/COD keywords
|
||||
if not any(kw in next_upper for kw in ['CUI', 'CIF', 'COD', 'FISCAL']):
|
||||
return (self._clean_company_name(next_line), 0.75)
|
||||
|
||||
# Strategy 3: Look for any line with company indicators in CLIENT section region
|
||||
search_end = min(client_section_idx + 5, len(lines))
|
||||
for i in range(client_section_idx + 1, search_end):
|
||||
line = lines[i].strip()
|
||||
line_upper = line.upper()
|
||||
|
||||
# Skip CUI/CIF lines
|
||||
if re.search(r'C\.?\s*[UI]\.?\s*[IF]\.?', line_upper):
|
||||
continue
|
||||
if re.match(r'^R[O0]?\d{6,10}$', line_upper):
|
||||
continue
|
||||
|
||||
if any(re.search(ind, line_upper) for ind in self.COMPANY_INDICATORS):
|
||||
return (self._clean_company_name(line), 0.85)
|
||||
|
||||
return (None, 0.0)
|
||||
|
||||
@staticmethod
|
||||
def _clean_company_name(name: str) -> str:
|
||||
"""Clean company name for storage."""
|
||||
if not name:
|
||||
return ""
|
||||
# Remove extra whitespace
|
||||
name = re.sub(r'\s+', ' ', name).strip()
|
||||
# Remove trailing punctuation except periods in S.R.L., S.A., etc.
|
||||
name = re.sub(r'[,;:]+$', '', name).strip()
|
||||
return name
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Validation hints - override to customize validation behavior
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""
|
||||
Return validation hints for this store.
|
||||
|
||||
Returns:
|
||||
Dict with validation hints. Common keys:
|
||||
- has_multi_rate_tva: bool - Store uses multiple TVA rates
|
||||
- card_equals_total: bool - CARD payment equals total
|
||||
- has_client_cui: bool - Receipt includes client CUI
|
||||
- has_efactura: bool - Store uses e-factura format
|
||||
- is_non_vat_payer: bool - Store is not a VAT payer
|
||||
"""
|
||||
return {}
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Helper methods - available to all subclasses
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
@staticmethod
|
||||
def _normalize_number(text: str) -> str:
|
||||
"""
|
||||
Normalize a number string for Decimal conversion.
|
||||
|
||||
Handles Romanian formats: "1.234,56" -> "1234.56"
|
||||
"""
|
||||
if not text:
|
||||
return "0"
|
||||
|
||||
# Remove spaces
|
||||
text = text.replace(" ", "")
|
||||
|
||||
# Determine decimal separator
|
||||
last_comma = text.rfind(",")
|
||||
last_dot = text.rfind(".")
|
||||
|
||||
if last_comma > last_dot:
|
||||
text = text.replace(".", "").replace(",", ".")
|
||||
elif last_dot > last_comma:
|
||||
text = text.replace(",", "")
|
||||
else:
|
||||
text = text.replace(",", ".")
|
||||
|
||||
return text
|
||||
|
||||
@staticmethod
|
||||
def _parse_decimal(text: str) -> Optional[Decimal]:
|
||||
"""Parse a string to Decimal, handling various formats."""
|
||||
try:
|
||||
normalized = BaseStoreProfile._normalize_number(text)
|
||||
return Decimal(normalized)
|
||||
except (InvalidOperation, ValueError, TypeError):
|
||||
return None
|
||||
|
||||
@staticmethod
|
||||
def _parse_date(text: str) -> Optional[date]:
|
||||
"""
|
||||
Parse date string in various formats.
|
||||
|
||||
Supports: DD-MM-YYYY, DD/MM/YYYY, DD.MM.YYYY, YYYY-MM-DD
|
||||
"""
|
||||
if not text:
|
||||
return None
|
||||
|
||||
# Normalize separators
|
||||
text = text.replace('/', '-').replace('.', '-')
|
||||
|
||||
try:
|
||||
parts = text.split('-')
|
||||
if len(parts) != 3:
|
||||
return None
|
||||
|
||||
# Determine format based on first part length
|
||||
if len(parts[0]) == 4:
|
||||
# YYYY-MM-DD
|
||||
year, month, day = int(parts[0]), int(parts[1]), int(parts[2])
|
||||
else:
|
||||
# DD-MM-YYYY
|
||||
day, month, year = int(parts[0]), int(parts[1]), int(parts[2])
|
||||
|
||||
# Validate ranges
|
||||
if 1 <= day <= 31 and 1 <= month <= 12 and 2000 <= year <= 2100:
|
||||
return date(year, month, day)
|
||||
except (ValueError, TypeError, IndexError):
|
||||
pass
|
||||
|
||||
return None
|
||||
|
||||
@staticmethod
|
||||
def _clean_text(text: str) -> str:
|
||||
"""Clean OCR text for pattern matching."""
|
||||
if not text:
|
||||
return ""
|
||||
text = re.sub(r'\s+', ' ', text)
|
||||
text = re.sub(r'[\x00-\x09\x0b\x0c\x0e-\x1f\x7f]', '', text)
|
||||
return text.strip()
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Magic methods
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
def __repr__(self) -> str:
|
||||
return f"<{self.__class__.__name__} CUI={self.CUI_LIST}>"
|
||||
|
||||
def __str__(self) -> str:
|
||||
return f"{self.STORE_NAME} ({', '.join(self.CUI_LIST)})"
|
||||
@@ -0,0 +1,54 @@
|
||||
"""
|
||||
BEST PRINT TRADE ACTIV SRL store profile for OCR extraction.
|
||||
|
||||
Stamp manufacturing service. Non-VAT payer (neplătitor de TVA).
|
||||
"""
|
||||
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class BestPrintProfile(BaseStoreProfile):
|
||||
"""
|
||||
BEST PRINT TRADE ACTIV SRL - non-VAT payer profile.
|
||||
|
||||
Key characteristics:
|
||||
- Non-VAT payer (neplătitor de TVA) - NO TVA on receipts
|
||||
- Stamp manufacturing and printing services
|
||||
- Total amount has no TVA component
|
||||
- CARD payment typical
|
||||
"""
|
||||
|
||||
CUI_LIST = ["45417955"]
|
||||
NAME_PATTERNS = ["BEST PRINT", "BESTPRINT", "BEST PRINT TRADE", "BEST PR1NT"]
|
||||
STORE_NAME = "BEST PRINT TRADE ACTIV SRL"
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract TVA entries - returns empty for non-VAT payer.
|
||||
|
||||
BEST PRINT is a non-VAT payer (neplătitor de TVA),
|
||||
so no TVA entries are expected on receipts.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt (unused)
|
||||
|
||||
Returns:
|
||||
Empty list (non-VAT payer has no TVA)
|
||||
"""
|
||||
# Non-VAT payer - no TVA entries
|
||||
return []
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return BEST PRINT-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": True,
|
||||
"has_client_cui": True, # May have client CUI
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": True, # CRITICAL: Non-VAT payer
|
||||
"tva_pattern": "none",
|
||||
}
|
||||
101
backend/modules/data_entry/services/ocr/profiles/brick.py
Normal file
101
backend/modules/data_entry/services/ocr/profiles/brick.py
Normal file
@@ -0,0 +1,101 @@
|
||||
"""
|
||||
BRICK (Five-Holding) store profile for OCR extraction.
|
||||
|
||||
Five-Holding S.A. operates BRICK stores with standard receipt format.
|
||||
"""
|
||||
|
||||
import re
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class BrickProfile(BaseStoreProfile):
|
||||
"""
|
||||
FIVE-HOLDING S.A. (BRICK) - standard TVA format.
|
||||
|
||||
Key characteristics:
|
||||
- Standard TVA format
|
||||
- Single TVA rate typically
|
||||
- No client CUI on receipts
|
||||
"""
|
||||
|
||||
CUI_LIST = ["10562600"]
|
||||
NAME_PATTERNS = ["BRICK", "FIVE-HOLDING", "FIVE HOLDING", "BR1CK"] # OCR variants
|
||||
STORE_NAME = "FIVE-HOLDING S.A."
|
||||
|
||||
# Standard TVA patterns (flexible - accepts any rate)
|
||||
TVA_PATTERNS = [
|
||||
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
|
||||
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "A - XX,XX% = YY,YY"
|
||||
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# Simple: "TVA XX% YY,YY"
|
||||
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||
]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract BRICK-specific TVA entries.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set()
|
||||
|
||||
# Try coded patterns first
|
||||
for pattern in self.TVA_PATTERNS[:2]:
|
||||
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
amount = self._parse_decimal(match.group(3))
|
||||
|
||||
if amount and amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation, IndexError):
|
||||
continue
|
||||
|
||||
# Fallback to simple format
|
||||
if not entries:
|
||||
simple_pattern = self.TVA_PATTERNS[2]
|
||||
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
percent = int(match.group(1))
|
||||
amount = self._parse_decimal(match.group(2))
|
||||
|
||||
if amount and amount > 0:
|
||||
entries.append({
|
||||
'code': 'A',
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
break
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return entries
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return BRICK-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": False,
|
||||
"has_client_cui": False,
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
}
|
||||
118
backend/modules/data_entry/services/ocr/profiles/dedeman.py
Normal file
118
backend/modules/data_entry/services/ocr/profiles/dedeman.py
Normal file
@@ -0,0 +1,118 @@
|
||||
"""
|
||||
DEDEMAN store profile for OCR extraction.
|
||||
|
||||
Dedeman receipts may include e-factura information and use standard TVA format.
|
||||
Large DIY retailer in Romania.
|
||||
"""
|
||||
|
||||
import re
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class DedemanProfile(BaseStoreProfile):
|
||||
"""
|
||||
DEDEMAN SRL - standard TVA with e-factura support.
|
||||
|
||||
Key characteristics:
|
||||
- Standard TVA format
|
||||
- May include e-factura reference number
|
||||
- Professional receipts for construction materials
|
||||
"""
|
||||
|
||||
CUI_LIST = ["2816464"]
|
||||
NAME_PATTERNS = ["DEDEMAN", "DEDEMAN SRL", "OEDEMAN", "D3DEMAN"] # OCR variants
|
||||
STORE_NAME = "DEDEMAN SRL"
|
||||
|
||||
# Standard TVA patterns (flexible - accepts any rate)
|
||||
TVA_PATTERNS = [
|
||||
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
|
||||
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "A - XX,XX% = YY,YY"
|
||||
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "TVA (XX%) YY,YY"
|
||||
r'TVA\s*\(?\s*(\d{1,2})\s*%\s*\)?\s*:?\s*([\d.,]+)',
|
||||
]
|
||||
|
||||
# E-factura pattern for reference extraction
|
||||
EFACTURA_PATTERN = r'e-?factura\s*:?\s*([A-Z0-9]+)'
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract Dedeman-specific TVA entries.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set()
|
||||
|
||||
# Try coded patterns first
|
||||
for pattern in self.TVA_PATTERNS[:2]:
|
||||
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
amount = self._parse_decimal(match.group(3))
|
||||
|
||||
if amount and amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation, IndexError):
|
||||
continue
|
||||
|
||||
# Fallback to simple format
|
||||
if not entries:
|
||||
simple_pattern = self.TVA_PATTERNS[2]
|
||||
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
percent = int(match.group(1))
|
||||
amount = self._parse_decimal(match.group(2))
|
||||
|
||||
if amount and amount > 0:
|
||||
entries.append({
|
||||
'code': 'A',
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
break
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return entries
|
||||
|
||||
def extract_efactura_reference(self, text: str) -> str | None:
|
||||
"""
|
||||
Extract e-factura reference number if present.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
E-factura reference string or None
|
||||
"""
|
||||
match = re.search(self.EFACTURA_PATTERN, text, re.IGNORECASE)
|
||||
return match.group(1) if match else None
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return Dedeman-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": False,
|
||||
"has_client_cui": False,
|
||||
"has_efactura": True,
|
||||
"is_non_vat_payer": False,
|
||||
}
|
||||
@@ -0,0 +1,102 @@
|
||||
"""
|
||||
ELECTROBERING S.R.L. store profile for OCR extraction.
|
||||
|
||||
Electronics and home supplies store.
|
||||
"""
|
||||
|
||||
import re
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class ElectroberingProfile(BaseStoreProfile):
|
||||
"""
|
||||
ELECTROBERING S.R.L. - standard TVA profile.
|
||||
|
||||
Key characteristics:
|
||||
- Standard TVA format (single rate, any percentage)
|
||||
- Electronics and home supplies
|
||||
- May have client CUI for B2B purchases
|
||||
- CARD payment typical
|
||||
"""
|
||||
|
||||
CUI_LIST = ["2744937"]
|
||||
NAME_PATTERNS = ["ELECTROBERING", "ELECTR0BERING", "ELECTROBERING SRL"]
|
||||
STORE_NAME = "ELECTROBERING S.R.L."
|
||||
|
||||
# Standard TVA patterns (flexible - accepts any rate)
|
||||
TVA_PATTERNS = [
|
||||
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
|
||||
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "A - XX,XX% = YY,YY"
|
||||
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "TVA XX% YY,YY" (simple format without code)
|
||||
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||
]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract TVA entries from receipt text.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set()
|
||||
|
||||
# Try coded patterns first
|
||||
for pattern in self.TVA_PATTERNS[:2]:
|
||||
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
amount = self._parse_decimal(match.group(3))
|
||||
|
||||
if amount and amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation, IndexError):
|
||||
continue
|
||||
|
||||
# Fallback to simple format
|
||||
if not entries:
|
||||
simple_pattern = self.TVA_PATTERNS[2]
|
||||
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
percent = int(match.group(1))
|
||||
amount = self._parse_decimal(match.group(2))
|
||||
|
||||
if amount and amount > 0:
|
||||
entries.append({
|
||||
'code': 'A',
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
break
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return entries
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return ELECTROBERING-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": True,
|
||||
"has_client_cui": True, # May have client CUI for B2B
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
}
|
||||
103
backend/modules/data_entry/services/ocr/profiles/gama_ink.py
Normal file
103
backend/modules/data_entry/services/ocr/profiles/gama_ink.py
Normal file
@@ -0,0 +1,103 @@
|
||||
"""
|
||||
GAMA INK SERVICE SRL store profile for OCR extraction.
|
||||
|
||||
Toner refill and printer supplies store.
|
||||
"""
|
||||
|
||||
import re
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class GamaInkProfile(BaseStoreProfile):
|
||||
"""
|
||||
GAMA INK SERVICE SRL - standard TVA profile.
|
||||
|
||||
Key characteristics:
|
||||
- Standard TVA format (single rate, any percentage)
|
||||
- Service-based (toner refill, printer supplies)
|
||||
- CARD payment typical
|
||||
"""
|
||||
|
||||
CUI_LIST = ["17741882"]
|
||||
NAME_PATTERNS = ["GAMA INK", "GAMA", "GAMAINK", "GAMA INK SERVICE"]
|
||||
STORE_NAME = "GAMA INK SERVICE SRL"
|
||||
|
||||
# Standard TVA patterns (flexible - accepts any rate)
|
||||
TVA_PATTERNS = [
|
||||
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
|
||||
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "A - XX,XX% = YY,YY"
|
||||
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "TVA XX% YY,YY" (simple format without code)
|
||||
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||
# "TVA: YY,YY" (amount only, percent inferred)
|
||||
r'TVA\s*:?\s*([\d.,]+)\s*(?:LEI|RON)?',
|
||||
]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract TVA entries from receipt text.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set()
|
||||
|
||||
# Try coded patterns first (have both code and percent)
|
||||
for pattern in self.TVA_PATTERNS[:2]:
|
||||
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
amount = self._parse_decimal(match.group(3))
|
||||
|
||||
if amount and amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation, IndexError):
|
||||
continue
|
||||
|
||||
# Fallback to simple format (percent + amount without code)
|
||||
if not entries:
|
||||
simple_pattern = self.TVA_PATTERNS[2]
|
||||
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
percent = int(match.group(1))
|
||||
amount = self._parse_decimal(match.group(2))
|
||||
|
||||
if amount and amount > 0:
|
||||
entries.append({
|
||||
'code': 'A',
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
break
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return entries
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return GAMA INK-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": True,
|
||||
"has_client_cui": False,
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
}
|
||||
@@ -0,0 +1,53 @@
|
||||
"""
|
||||
KINETERRA store profile for OCR extraction.
|
||||
|
||||
Kineterra is a non-VAT payer (neplătitor de TVA).
|
||||
Receipts don't include TVA breakdown.
|
||||
"""
|
||||
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class KineterraProfile(BaseStoreProfile):
|
||||
"""
|
||||
KINETERRA CONCEPT SRL - non-VAT payer profile.
|
||||
|
||||
Key characteristics:
|
||||
- Non-VAT payer (neplătitor de TVA)
|
||||
- No TVA breakdown on receipts
|
||||
- Total amount has no TVA component
|
||||
"""
|
||||
|
||||
CUI_LIST = ["31180432"]
|
||||
NAME_PATTERNS = ["KINETERRA", "KINETERRA CONCEPT", "K1NETERRA"] # OCR variants
|
||||
STORE_NAME = "KINETERRA CONCEPT SRL"
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract TVA entries - returns empty for non-VAT payer.
|
||||
|
||||
Kineterra is a non-VAT payer, so no TVA entries are expected.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt (unused)
|
||||
|
||||
Returns:
|
||||
Empty list (non-VAT payer has no TVA)
|
||||
"""
|
||||
# Non-VAT payer - no TVA entries
|
||||
return []
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return Kineterra-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": False,
|
||||
"has_client_cui": False,
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": True,
|
||||
"tva_pattern": "none",
|
||||
}
|
||||
93
backend/modules/data_entry/services/ocr/profiles/lidl.py
Normal file
93
backend/modules/data_entry/services/ocr/profiles/lidl.py
Normal file
@@ -0,0 +1,93 @@
|
||||
"""
|
||||
LIDL store profile for OCR extraction.
|
||||
|
||||
Lidl receipts have a specific TVA format without hyphen/colon separators:
|
||||
TOTAL TVA 9,84
|
||||
TVA A 21,00% 7,71
|
||||
TVA B 11,00% 2,13
|
||||
|
||||
This profile handles multi-rate TVA extraction for Lidl receipts.
|
||||
"""
|
||||
|
||||
import re
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class LidlProfile(BaseStoreProfile):
|
||||
"""
|
||||
LIDL DISCOUNT S.R.L. - multi-rate TVA profile.
|
||||
|
||||
Key characteristics:
|
||||
- Multi-rate TVA (codes A, B, C, D with any percentage - patterns are flexible)
|
||||
- TVA format: "TVA A XX,XX% YY,YY" (code + percent + amount on same line)
|
||||
- Supports historical rates (19%, 9%, 5%) and current rates (21%, 11%)
|
||||
- CARD payment usually equals total
|
||||
- No client CUI on receipts
|
||||
"""
|
||||
|
||||
CUI_LIST = ["22891860"]
|
||||
NAME_PATTERNS = ["LIDL", "LDL", "L1DL", "LIDL DISCOUNT"] # OCR variants
|
||||
STORE_NAME = "LIDL DISCOUNT S.R.L."
|
||||
|
||||
# Lidl-specific TVA patterns
|
||||
# Format: "TVA A 21,00% 7,71" (code + percent + amount on same line)
|
||||
TVA_PATTERNS = [
|
||||
# Primary: "TVA A 21,00% 7.71" with various spacing
|
||||
r'T[VU][AR]\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
|
||||
# With backslash OCR artifact: "TVA A \21,00% 7.71"
|
||||
r'T[VU][AR]\s+([A-D])\s+\\?(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
|
||||
# IVA variant (rare OCR misread)
|
||||
r'IVA\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
|
||||
]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract Lidl-specific TVA entries.
|
||||
|
||||
Handles multiple TVA rates (A, B, C, D) commonly found on Lidl receipts.
|
||||
Uses deduplication to avoid counting the same entry twice from different patterns.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set() # Deduplication key: (code, percent)
|
||||
|
||||
for pattern in self.TVA_PATTERNS:
|
||||
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
amount = self._parse_decimal(match.group(3))
|
||||
|
||||
if amount and amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return entries
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return Lidl-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": True,
|
||||
"card_equals_total": True,
|
||||
"has_client_cui": False,
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
}
|
||||
99
backend/modules/data_entry/services/ocr/profiles/omv.py
Normal file
99
backend/modules/data_entry/services/ocr/profiles/omv.py
Normal file
@@ -0,0 +1,99 @@
|
||||
"""
|
||||
OMV Petrom store profile for OCR extraction.
|
||||
|
||||
OMV receipts typically include client CUI and use standard TVA format.
|
||||
Common at gas stations with fuel purchases.
|
||||
|
||||
Date format: YYYY. MM. DD with spaces (e.g., "2025. 08. 14")
|
||||
"""
|
||||
|
||||
import re
|
||||
from datetime import date
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any, Tuple, Optional
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class OMVProfile(BaseStoreProfile):
|
||||
"""
|
||||
OMV PETROM MARKETING S.R.L. - standard TVA with client CUI.
|
||||
|
||||
Key characteristics:
|
||||
- Standard TVA format (usually single rate, any percentage)
|
||||
- Includes client CUI on receipt (for business purchases)
|
||||
- TVA table format: "A-XX,XX% base_amount tva_amount"
|
||||
- Supports historical rates (19%) and current rates (21%)
|
||||
- Date format: YYYY. MM. DD (with spaces)
|
||||
"""
|
||||
|
||||
CUI_LIST = ["11201891"]
|
||||
NAME_PATTERNS = ["OMV", "PETROM", "OMV PETROM", "0MV"] # OCR variants
|
||||
STORE_NAME = "OMV PETROM MARKETING S.R.L."
|
||||
|
||||
# OMV TVA table pattern: "A-19,00% 285,66 49,58" (code-percent base tva)
|
||||
TVA_TABLE_PATTERN = r'([A-D])\s*[-:]\s*(\d{1,2})[.,]\d{2}\s*%\s+([\d.,]+)\s+([\d.,]+)'
|
||||
|
||||
# Standard TVA pattern fallback
|
||||
TVA_STANDARD_PATTERN = r'TVA\s*:?\s*([\d.,]+)'
|
||||
|
||||
# OMV specific: prioritize YYYY. MM. DD format with spaces
|
||||
DATE_PATTERNS_OCR_SPACES = [
|
||||
# YYYY. MM. DD with time (OMV format)
|
||||
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})\s+\d{2}:\d{2}', 0.98, 'ymd'),
|
||||
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})', 0.95, 'ymd'),
|
||||
# Fallback to DD. MM. YYYY
|
||||
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})\s+\d{2}:\d{2}', 0.92, 'dmy'),
|
||||
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})', 0.85, 'dmy'),
|
||||
]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract OMV-specific TVA entries.
|
||||
|
||||
OMV receipts often show TVA in table format with base and TVA amounts.
|
||||
Falls back to standard extraction if table format not found.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set()
|
||||
|
||||
# Try table format first (more accurate)
|
||||
for match in re.finditer(self.TVA_TABLE_PATTERN, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
# TVA amount is the second number (smaller one)
|
||||
tva_amount = self._parse_decimal(match.group(4))
|
||||
|
||||
if tva_amount and tva_amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': tva_amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return entries
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return OMV-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": False,
|
||||
"has_client_cui": True,
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
"tva_table_format": True,
|
||||
}
|
||||
101
backend/modules/data_entry/services/ocr/profiles/pictus_velum.py
Normal file
101
backend/modules/data_entry/services/ocr/profiles/pictus_velum.py
Normal file
@@ -0,0 +1,101 @@
|
||||
"""
|
||||
PICTUS VELUM SRL store profile for OCR extraction.
|
||||
|
||||
Office supplies and stationery store.
|
||||
"""
|
||||
|
||||
import re
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class PictusVelumProfile(BaseStoreProfile):
|
||||
"""
|
||||
PICTUS VELUM SRL - standard TVA profile.
|
||||
|
||||
Key characteristics:
|
||||
- Standard TVA format (single rate, any percentage)
|
||||
- Office supplies and stationery (rechizite)
|
||||
- CARD payment typical
|
||||
"""
|
||||
|
||||
CUI_LIST = ["39634534"]
|
||||
NAME_PATTERNS = ["PICTUS", "PICTUS VELUM", "P1CTUS", "PICTUS VELUM SRL"]
|
||||
STORE_NAME = "PICTUS VELUM SRL"
|
||||
|
||||
# Standard TVA patterns (flexible - accepts any rate)
|
||||
TVA_PATTERNS = [
|
||||
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
|
||||
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "A - XX,XX% = YY,YY"
|
||||
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "TVA XX% YY,YY" (simple format without code)
|
||||
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||
]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract TVA entries from receipt text.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set()
|
||||
|
||||
# Try coded patterns first
|
||||
for pattern in self.TVA_PATTERNS[:2]:
|
||||
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
amount = self._parse_decimal(match.group(3))
|
||||
|
||||
if amount and amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation, IndexError):
|
||||
continue
|
||||
|
||||
# Fallback to simple format
|
||||
if not entries:
|
||||
simple_pattern = self.TVA_PATTERNS[2]
|
||||
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
percent = int(match.group(1))
|
||||
amount = self._parse_decimal(match.group(2))
|
||||
|
||||
if amount and amount > 0:
|
||||
entries.append({
|
||||
'code': 'A',
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
break
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return entries
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return PICTUS VELUM-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": True,
|
||||
"has_client_cui": False,
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
}
|
||||
111
backend/modules/data_entry/services/ocr/profiles/socar.py
Normal file
111
backend/modules/data_entry/services/ocr/profiles/socar.py
Normal file
@@ -0,0 +1,111 @@
|
||||
"""
|
||||
SOCAR Petroleum store profile for OCR extraction.
|
||||
|
||||
SOCAR receipts are similar to OMV - gas station with client CUI support.
|
||||
Date format may use YYYY. MM. DD with spaces.
|
||||
"""
|
||||
|
||||
import re
|
||||
from datetime import date
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any, Tuple, Optional
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class SocarProfile(BaseStoreProfile):
|
||||
"""
|
||||
SOCAR PETROLEUM S.A. - standard TVA with client CUI.
|
||||
|
||||
Key characteristics:
|
||||
- Standard TVA format (usually single rate)
|
||||
- Includes client CUI on receipt (for business purchases)
|
||||
- Similar format to OMV/Petrom
|
||||
- Date format may use YYYY. MM. DD (with spaces)
|
||||
"""
|
||||
|
||||
CUI_LIST = ["12546600"]
|
||||
NAME_PATTERNS = ["SOCAR", "S0CAR", "SOCAR PETROLEUM"] # OCR variants
|
||||
STORE_NAME = "SOCAR PETROLEUM S.A."
|
||||
|
||||
# Standard TVA patterns for gas stations
|
||||
TVA_PATTERNS = [
|
||||
# Table format: "A-19,00% 285,66 49,58"
|
||||
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]\d{2}\s*%\s+([\d.,]+)\s+([\d.,]+)',
|
||||
# Simple format: "TVA 19% 49,58"
|
||||
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||
]
|
||||
|
||||
# Gas stations may use YYYY. MM. DD format
|
||||
DATE_PATTERNS_OCR_SPACES = [
|
||||
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})\s+\d{2}:\d{2}', 0.98, 'ymd'),
|
||||
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})', 0.95, 'ymd'),
|
||||
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})\s+\d{2}:\d{2}', 0.92, 'dmy'),
|
||||
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})', 0.85, 'dmy'),
|
||||
]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract SOCAR-specific TVA entries.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set()
|
||||
|
||||
# Try table format first
|
||||
table_pattern = self.TVA_PATTERNS[0]
|
||||
for match in re.finditer(table_pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
tva_amount = self._parse_decimal(match.group(4))
|
||||
|
||||
if tva_amount and tva_amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': tva_amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
# Fallback to simple format if no table entries found
|
||||
if not entries:
|
||||
simple_pattern = self.TVA_PATTERNS[1]
|
||||
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
percent = int(match.group(1))
|
||||
amount = self._parse_decimal(match.group(2))
|
||||
|
||||
if amount and amount > 0:
|
||||
# Default to code 'A' for simple format
|
||||
entries.append({
|
||||
'code': 'A',
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
break # Only take first match for simple format
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return entries
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return SOCAR-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": False,
|
||||
"has_client_cui": True,
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
}
|
||||
@@ -0,0 +1,112 @@
|
||||
"""
|
||||
STEPOUT MARKET SRL store profile for OCR extraction.
|
||||
|
||||
Bookstore with reduced TVA rate (5% for books in Romania).
|
||||
"""
|
||||
|
||||
import re
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class StepoutMarketProfile(BaseStoreProfile):
|
||||
"""
|
||||
STEPOUT MARKET SRL - reduced TVA rate profile (books).
|
||||
|
||||
Key characteristics:
|
||||
- Reduced TVA rate: 5% for books (cărți qualification in Romania)
|
||||
- May also have standard rates for non-book items
|
||||
- Patterns are flexible to accept ANY TVA rate
|
||||
- CARD payment typical
|
||||
"""
|
||||
|
||||
CUI_LIST = ["35532655"]
|
||||
NAME_PATTERNS = ["STEPOUT", "STEPOUT MARKET", "STEP0UT", "STEPOUT MARKET SRL"]
|
||||
STORE_NAME = "STEPOUT MARKET SRL"
|
||||
|
||||
# TVA patterns (flexible - accepts any rate including 5%)
|
||||
TVA_PATTERNS = [
|
||||
# "TVA A: 5% = YY,YY" or "TVA-A 5% YY,YY" (coded format)
|
||||
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "A - 5,00% = YY,YY" (table format)
|
||||
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "TVA 5% YY,YY" (simple format - common for single rate)
|
||||
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||
# "TVA 5,00%: YY,YY" (percent with colon)
|
||||
r'TVA\s+(\d{1,2})[.,]\d{2}\s*%\s*:?\s*([\d.,]+)',
|
||||
]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract TVA entries from receipt text.
|
||||
|
||||
Stepout Market primarily sells books which have 5% TVA in Romania.
|
||||
The patterns are generic and will extract whatever rate is on the receipt.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set()
|
||||
|
||||
# Try coded patterns first (have code letter)
|
||||
for pattern in self.TVA_PATTERNS[:2]:
|
||||
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
amount = self._parse_decimal(match.group(3))
|
||||
|
||||
if amount and amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation, IndexError):
|
||||
continue
|
||||
|
||||
# Fallback to simple format (no code letter, just percent + amount)
|
||||
if not entries:
|
||||
for pattern in self.TVA_PATTERNS[2:]:
|
||||
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
percent = int(match.group(1))
|
||||
amount = self._parse_decimal(match.group(2))
|
||||
|
||||
if amount and amount > 0:
|
||||
# Default to code 'A' for simple format
|
||||
entries.append({
|
||||
'code': 'A',
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
break # Only take first match for simple format
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
if entries:
|
||||
break
|
||||
|
||||
return entries
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return STEPOUT MARKET-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": True,
|
||||
"has_client_cui": True, # May have client CUI
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
"typical_tva_rate": 5, # Books have 5% TVA in Romania
|
||||
"product_category": "books",
|
||||
}
|
||||
@@ -0,0 +1,103 @@
|
||||
"""
|
||||
UNLIMITED KEYS S.R.L. store profile for OCR extraction.
|
||||
|
||||
Key duplication service. Notable for CASH (NUMERAR) payments.
|
||||
"""
|
||||
|
||||
import re
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class UnlimitedKeysProfile(BaseStoreProfile):
|
||||
"""
|
||||
UNLIMITED KEYS S.R.L. - standard TVA profile with NUMERAR payment.
|
||||
|
||||
Key characteristics:
|
||||
- Standard TVA format (single rate, any percentage)
|
||||
- Key duplication service
|
||||
- NUMERAR (cash) payment common - different from most stores!
|
||||
- May also accept CARD
|
||||
"""
|
||||
|
||||
CUI_LIST = ["18993187"]
|
||||
NAME_PATTERNS = ["UNLIMITED KEYS", "UNLIMITED", "UNL1MITED", "UNLIMITED KEYS SRL"]
|
||||
STORE_NAME = "UNLIMITED KEYS S.R.L."
|
||||
|
||||
# Standard TVA patterns (flexible - accepts any rate)
|
||||
TVA_PATTERNS = [
|
||||
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
|
||||
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "A - XX,XX% = YY,YY"
|
||||
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "TVA XX% YY,YY" (simple format without code)
|
||||
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||
]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract TVA entries from receipt text.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set()
|
||||
|
||||
# Try coded patterns first
|
||||
for pattern in self.TVA_PATTERNS[:2]:
|
||||
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
amount = self._parse_decimal(match.group(3))
|
||||
|
||||
if amount and amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation, IndexError):
|
||||
continue
|
||||
|
||||
# Fallback to simple format
|
||||
if not entries:
|
||||
simple_pattern = self.TVA_PATTERNS[2]
|
||||
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
percent = int(match.group(1))
|
||||
amount = self._parse_decimal(match.group(2))
|
||||
|
||||
if amount and amount > 0:
|
||||
entries.append({
|
||||
'code': 'A',
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
break
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return entries
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return UNLIMITED KEYS-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": False, # May be NUMERAR (cash)
|
||||
"has_client_cui": True, # May have client CUI
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
"common_payment": "NUMERAR", # Cash payments common
|
||||
}
|
||||
Reference in New Issue
Block a user