feat(ocr): Add modular store profiles with hot-reload support

## Store Profiles System
- Add ProfileRegistry for CUI-based profile lookup
- Add BaseStoreProfile with generic extraction patterns
- Implement hot-reload via POST /api/data-entry/ocr/profiles/reload

## 12 Store Profiles
- LIDL: Multi-rate TVA (A, B, C, D codes)
- OMV, SOCAR: B2B with client CUI, YYYY.MM.DD dates
- BRICK, DEDEMAN: Standard TVA, e-factura support
- KINETERRA, BEST PRINT: Non-VAT payers (returns [])
- STEPOUT MARKET: TVA 5% (books/reduced rate)
- UNLIMITED KEYS: NUMERAR payment detection
- GAMA INK, ELECTROBERING, PICTUS VELUM: Standard TVA

## Flexible TVA Patterns
- All patterns use (\d{1,2})% to accept any rate
- Supports historical (19%, 9%, 5%) and current (21%, 11%)

## Payment Methods Fix
- Fixed base.py to support multiple payments of same type
- Changed deduplication from method-only to (method, amount) tuple
- Returns separate entries for split payments

## Tools
- Add generate_store_profile.py for automatic profile generation
- Analyzes PDFs via OCR API and detects patterns

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Claude Agent
2026-01-06 23:07:07 +00:00
parent 67b0082df0
commit 099556213d
25 changed files with 3707 additions and 114 deletions

View File

@@ -0,0 +1,106 @@
# Claude Learn: Backend
**Domain**: backend
**Last updated**: 2026-01-06
**Sessions recorded**: 1
Knowledge about FastAPI, Python services, Oracle DB, and backend architecture.
---
## Patterns
### ProfileRegistry cu Hot-Reload pentru Store Profiles
**Discovered**: 2026-01-06 (feature: ocr-store-profiles)
**Description**: Sistem de înregistrare profile OCR folosind decorator `@ProfileRegistry.register` cu hot-reload via `importlib.reload()`. Permite adăugarea/modificarea profilelor fără restart server.
**Example** (`backend/modules/data_entry/services/ocr/profiles/__init__.py`):
```python
class ProfileRegistry:
_profiles: Dict[str, Type["BaseStoreProfile"]] = {}
_instances: Dict[str, "BaseStoreProfile"] = {}
@classmethod
def register(cls, profile_class):
"""Decorator to register a store profile class."""
for cui in profile_class.CUI_LIST:
cls._profiles[cls._normalize_cui(cui)] = profile_class
return profile_class
@classmethod
def reload_all(cls):
"""Hot-reload all profile modules via importlib.reload()."""
cls._instances.clear()
for module_name in cls._get_profile_module_names():
importlib.reload(sys.modules[f"backend...profiles.{module_name}"])
```
**Usage**:
```python
@ProfileRegistry.register
class LidlProfile(BaseStoreProfile):
CUI_LIST = ["22891860"]
STORE_NAME = "LIDL DISCOUNT S.R.L."
# Lookup
profile = ProfileRegistry.get_profile("22891860")
# Hot-reload (endpoint)
POST /api/data-entry/ocr/profiles/reload
```
**Tags**: registry-pattern, hot-reload, decorator, ocr, singleton
---
### Script generare cod Python din analiză PDF
**Discovered**: 2026-01-06 (feature: ocr-store-profiles)
**Description**: Script care analizează PDF-uri via OCR API, detectează pattern-uri (TVA format, date format, payment) și generează automat cod Python pentru profile noi. Include JWT auth, async polling, și verificare sintaxă.
**Example** (`scripts/generate_store_profile.py`):
```python
def analyze_tva_patterns(results: List[Dict]) -> Dict:
"""Detectează format TVA dominant din rezultatele OCR."""
tva_formats = defaultdict(int)
for text in raw_texts:
if re.search(r'TVA\s+[A-D]\s+\d{1,2}', text_upper):
tva_formats["lidl_multi_rate"] += 1
if re.search(r'BAZA\s+TVA', text_upper):
tva_formats["table"] += 1
return {"dominant_format": max(tva_formats, key=tva_formats.get)}
def generate_profile_code(store_name, cui, tva_analysis, ...):
"""Generează cod Python pentru clasa de profil."""
# Template-based generation cu OCR error variants
```
**Usage**:
```bash
# Dry-run pentru preview
python scripts/generate_store_profile.py \
--name "Magazin Nou" --cui "12345678" \
--receipts "docs/data-entry/MagazinNou*.pdf" --dry-run
# Generează și salvează
python scripts/generate_store_profile.py \
--name "Magazin Nou" --cui "12345678" \
--receipts "docs/data-entry/MagazinNou*.pdf" \
--output backend/.../profiles/magazin_nou.py
```
**Tags**: code-generation, ocr, automation, cli-tool
---
## Gotchas
_(None recorded yet)_
---
## Statistics
- **Total Patterns**: 2
- **Total Gotchas**: 0
- **Last Session**: 2026-01-06
- **Sessions Recorded**: 1

View File

@@ -0,0 +1,28 @@
# Claude Learn: Database
**Domain**: database
**Last updated**: -
**Sessions recorded**: 0
Knowledge about Oracle DB, SQLite, SQLModel, migrations, and data modeling.
---
## Patterns
_(None recorded yet)_
---
## Gotchas
_(None recorded yet)_
---
## Statistics
- **Total Patterns**: 0
- **Total Gotchas**: 0
- **Last Session**: -
- **Sessions Recorded**: 0

View File

@@ -0,0 +1,55 @@
# Claude Learn: Deployment
**Domain**: deployment
**Last updated**: 2026-01-06
**Sessions recorded**: 1
Knowledge about IIS, Docker, deployment scripts, and infrastructure.
---
## Patterns
### IIS URL Rewrite Rules for SPA with Multiple API Backends
**Discovered**: 2025-12-22 (feature: unified-app)
**Description**: Configure IIS web.config to proxy different API paths to different backend ports while serving SPA for all other routes. Enables single IIS site to route to multiple microservices.
**Example** (`public/web.config:5-28`):
```xml
<rewrite>
<rules>
<rule name="Proxy Reports API" stopProcessing="true">
<match url="^api/reports/(.*)" />
<action type="Rewrite" url="http://localhost:8001/api/{R:1}" />
</rule>
<rule name="Proxy Data Entry API" stopProcessing="true">
<match url="^api/data-entry/(.*)" />
<action type="Rewrite" url="http://localhost:8003/api/{R:1}" />
</rule>
<rule name="SPA Fallback" stopProcessing="true">
<match url=".*" />
<conditions logicalGrouping="MatchAll">
<add input="{REQUEST_FILENAME}" matchType="IsFile" negate="true" />
</conditions>
<action type="Rewrite" url="/index.html" />
</rule>
</rules>
</rewrite>
```
**Tags**: iis, deployment, spa, microservices, proxy
---
## Gotchas
_(None recorded yet)_
---
## Statistics
- **Total Patterns**: 1
- **Total Gotchas**: 0
- **Last Session**: 2026-01-06
- **Sessions Recorded**: 1

View File

@@ -0,0 +1,83 @@
# Claude Learn Domains Configuration
**Last updated**: 2026-01-06
This file defines available knowledge domains and their file path patterns.
---
## Domains
### frontend
**File**: `claude-learn-frontend.md`
**Patterns**:
- `src/**/*.vue`
- `src/**/*.js`
- `src/**/*.ts`
- `src/**/*.css`
- `vite.config.*`
- `package.json`
---
### backend
**File**: `claude-learn-backend.md`
**Patterns**:
- `backend/**/*.py`
- `backend/modules/**/*`
- `requirements.txt`
---
### database
**File**: `claude-learn-database.md`
**Patterns**:
- `**/*.sql`
- `**/models.py`
- `**/schemas.py`
- `backend/**/db/**/*`
---
### testing
**File**: `claude-learn-testing.md`
**Patterns**:
- `tests/**/*`
- `**/*.test.*`
- `**/*.spec.*`
- `pytest.ini`
- `vitest.config.*`
---
### deployment
**File**: `claude-learn-deployment.md`
**Patterns**:
- `deployment/**/*`
- `public/web.config`
- `Dockerfile*`
- `docker-compose*.yml`
- `*.sh`
- `ansible/**/*`
---
### global
**File**: `claude-learn-global.md`
**Patterns**:
- `*` (catch-all for cross-cutting concerns)
---
## Statistics
| Domain | Patterns | Gotchas | Last Updated |
|--------|----------|---------|--------------|
| frontend | 8 | 10 | 2026-01-06 |
| deployment | 1 | 0 | 2026-01-06 |
| global | 0 | 1 | 2026-01-06 |
| backend | 2 | 0 | 2026-01-06 |
| database | 0 | 0 | - |
| testing | 0 | 0 | - |
**Total**: 11 patterns, 11 gotchas across 4 domains

View File

@@ -1,9 +1,10 @@
# Learned Patterns & Gotchas
# Claude Learn: Frontend
**Last updated**: 2025-12-24
**Maintained**: Manually (add new patterns/gotchas as discovered)
**Domain**: frontend
**Last updated**: 2026-01-06
**Sessions recorded**: 3
This file contains insights learned during feature implementations. Claude Code auto-loads this file to prevent repeating past mistakes.
Knowledge about Vue.js, Vite, Pinia, CSS, and frontend architecture.
---
@@ -130,37 +131,6 @@ resolve: {
---
### IIS URL Rewrite Rules for SPA with Multiple API Backends
**Discovered**: 2025-12-22 (feature: unified-app)
**Description**: Configure IIS web.config to proxy different API paths to different backend ports while serving SPA for all other routes. Enables single IIS site to route to multiple microservices.
**Example** (`public/web.config:5-28`):
```xml
<rewrite>
<rules>
<rule name="Proxy Reports API" stopProcessing="true">
<match url="^api/reports/(.*)" />
<action type="Rewrite" url="http://localhost:8001/api/{R:1}" />
</rule>
<rule name="Proxy Data Entry API" stopProcessing="true">
<match url="^api/data-entry/(.*)" />
<action type="Rewrite" url="http://localhost:8003/api/{R:1}" />
</rule>
<rule name="SPA Fallback" stopProcessing="true">
<match url=".*" />
<conditions logicalGrouping="MatchAll">
<add input="{REQUEST_FILENAME}" matchType="IsFile" negate="true" />
</conditions>
<action type="Rewrite" url="/index.html" />
</rule>
</rules>
</rewrite>
```
**Tags**: iis, deployment, spa, microservices, proxy
---
### Vue Watcher for Auto-Loading Dependent Data
**Discovered**: 2025-12-24 (feature: unified-app-ux)
**Description**: Use Vue watch() to automatically trigger data loading when dependent selections change. Watch company selection changes to auto-load accounting periods, ensuring UI stays synchronized without manual intervention.
@@ -248,15 +218,6 @@ const getStorageKey = () => {
---
### Sed Command Quote Mismatch in Bulk Find-Replace
**Discovered**: 2025-12-22 (feature: unified-app)
**Problem**: Bulk sed commands using single quotes in pattern didn't match imports using double quotes, and vice versa. Commands like sed 's|from '@/stores/'|...' didn't replace from "@/stores/" lines.
**Solution**: Always use the quote style that matches the target files. For Vue/JS files with ESLint using double quotes, use double quotes in sed patterns. Better yet: use find -exec with separate sed for each file to handle both quote styles.
**Tags**: sed, regex, scripting, find-replace, migration
---
### Circular Reference in API Wrapper
**Discovered**: 2025-12-22 (feature: unified-app)
**Problem**: receiptsStore.js failed to build with 'Identifier api has already been declared' because it imported api and then declared const api = { ... } wrapper object using the same name.
@@ -287,7 +248,7 @@ const getStorageKey = () => {
### Vite Build Transform Count is Progress Indicator
**Discovered**: 2025-12-22 (feature: unified-app)
**Problem**: Hard to tell if build is making progress when fixing import issues. Each fix revealed new errors, causing frustration.
**Solution**: Watch the 'transforming... N modules transformed' count - it increases with each successful fix even if build ultimately fails. Going from 20057314901492 modules meant we were getting close to success. Use this as encouragement!
**Solution**: Watch the 'transforming... N modules transformed' count - it increases with each successful fix even if build ultimately fails. Going from 200->573->1490->1492 modules meant we were getting close to success. Use this as encouragement!
**Tags**: vite, build, debugging, progress-tracking, developer-experience
@@ -329,9 +290,9 @@ const getStorageKey = () => {
---
## Memory Statistics
## Statistics
- **Total Patterns**: 9
- **Total Gotchas**: 11
- **Last Session**: 2025-12-24 (unified-app-ux)
- **Sessions Recorded**: 2
- **Total Patterns**: 8
- **Total Gotchas**: 10
- **Last Session**: 2026-01-06
- **Sessions Recorded**: 3

View File

@@ -0,0 +1,33 @@
# Claude Learn: Global
**Domain**: global
**Last updated**: 2026-01-06
**Sessions recorded**: 1
Cross-cutting knowledge applicable to multiple domains (scripting, tooling, workflow).
---
## Patterns
_(None recorded yet)_
---
## Gotchas
### Sed Command Quote Mismatch in Bulk Find-Replace
**Discovered**: 2025-12-22 (feature: unified-app)
**Problem**: Bulk sed commands using single quotes in pattern didn't match imports using double quotes, and vice versa. Commands like sed 's|from '@/stores/'|...' didn't replace from "@/stores/" lines.
**Solution**: Always use the quote style that matches the target files. For Vue/JS files with ESLint using double quotes, use double quotes in sed patterns. Better yet: use find -exec with separate sed for each file to handle both quote styles.
**Tags**: sed, regex, scripting, find-replace, migration
---
## Statistics
- **Total Patterns**: 0
- **Total Gotchas**: 1
- **Last Session**: 2026-01-06
- **Sessions Recorded**: 1

View File

@@ -0,0 +1,28 @@
# Claude Learn: Testing
**Domain**: testing
**Last updated**: -
**Sessions recorded**: 0
Knowledge about pytest, Vitest, test patterns, and validation strategies.
---
## Patterns
_(None recorded yet)_
---
## Gotchas
_(None recorded yet)_
---
## Statistics
- **Total Patterns**: 0
- **Total Gotchas**: 0
- **Last Session**: -
- **Sessions Recorded**: 0

View File

@@ -628,3 +628,86 @@ def _dict_to_extraction_data(data: dict) -> ExtractionData:
validation_errors=data.get('validation_errors', []),
inter_ocr_ratios=data.get('inter_ocr_ratios', {}),
)
# ============================================================================
# Store Profiles Management Endpoints
# ============================================================================
@router.post("/profiles/reload")
async def reload_store_profiles(
current_user: CurrentUser = Depends(get_current_user)
) -> dict:
"""
Hot-reload all store profiles.
Reloads profile Python modules without server restart.
Use after adding/modifying profile files.
Returns:
Dict with reloaded count and profile list
"""
from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
count = ProfileRegistry.reload_all()
status = ProfileRegistry.get_reload_status()
return {
"success": True,
"reloaded_modules": count,
"profiles_count": status["profiles_count"],
"registered_cuis": status["registered_cuis"],
"last_reload": status["last_reload"],
}
@router.get("/profiles")
async def list_store_profiles(
current_user: CurrentUser = Depends(get_current_user)
) -> dict:
"""
List all registered store profiles.
Returns:
Dict with profiles list and status
"""
from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
profiles = ProfileRegistry.list_profiles()
status = ProfileRegistry.get_reload_status()
return {
"profiles": profiles,
"count": len(profiles),
"last_reload": status["last_reload"],
}
@router.get("/profiles/{cui}")
async def get_store_profile(
cui: str,
current_user: CurrentUser = Depends(get_current_user)
) -> dict:
"""
Get details for a specific store profile.
Args:
cui: Store CUI (with or without RO prefix)
Returns:
Profile details including validation hints
Raises:
404: If no profile exists for this CUI
"""
from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
info = ProfileRegistry.get_profile_info(cui)
if not info:
raise HTTPException(
status_code=404,
detail=f"No profile registered for CUI: {cui}"
)
return info

View File

@@ -0,0 +1,258 @@
# Store Profiles - OCR Extraction
Sistem de profile specifice pentru extracție OCR cu hot-reload.
---
## Quick Start: Adaugă un profil nou
```bash
# 1. Generează profil din PDF-uri (dry-run pentru preview)
python scripts/generate_store_profile.py \
--name "Magazin Nou SRL" \
--cui "12345678" \
--receipts "docs/data-entry/MagazinNou*.pdf" \
--dry-run
# 2. Generează și salvează
python scripts/generate_store_profile.py \
--name "Magazin Nou SRL" \
--cui "12345678" \
--receipts "docs/data-entry/MagazinNou*.pdf" \
--output backend/modules/data_entry/services/ocr/profiles/magazin_nou.py
# 3. Hot-reload (fără restart server)
curl -X POST http://localhost:8000/api/data-entry/ocr/profiles/reload
# 4. Verifică
curl http://localhost:8000/api/data-entry/ocr/profiles
```
---
## Structura directorului
```
profiles/
├── __init__.py # ProfileRegistry + hot-reload (~390 linii)
├── base.py # BaseStoreProfile + pattern-uri generice (~410 linii)
├── lidl.py # Multi-rate TVA (A/B)
├── omv.py # B2B, date YYYY.MM.DD
├── socar.py # B2B, date YYYY.MM.DD
├── brick.py # Standard TVA
├── dedeman.py # E-factura support
├── kineterra.py # Non-VAT payer
├── gama_ink.py # Standard TVA (toner/cartușe)
├── electrobering.py # Standard TVA (electronice)
├── pictus_velum.py # Standard TVA (rechizite)
├── unlimited_keys.py # Standard TVA, NUMERAR payment
├── best_print.py # Non-VAT payer (neplătitor TVA)
├── stepout_market.py # TVA 5% (cărți/librărie)
└── README.md # Acest fișier
```
---
## Profile existente (12 profile)
> **Note**: Pattern-urile TVA sunt **flexibile** și acceptă ORICE cotă (5%, 9%, 11%, 19%, 21%, etc.)
> pentru a gestiona atât datele istorice cât și schimbările viitoare ale legislației.
| Magazin | CUI | Fișier | Caracteristici |
|---------|-----|--------|----------------|
| LIDL DISCOUNT S.R.L. | 22891860 | `lidl.py` | Multi-rate TVA (coduri A, B, C, D) |
| OMV PETROM MARKETING S.R.L. | 11201891 | `omv.py` | B2B (client CUI), date YYYY.MM.DD |
| SOCAR PETROLEUM S.A. | 12546600 | `socar.py` | B2B (client CUI), date YYYY.MM.DD |
| FIVE-HOLDING S.A. (BRICK) | 10562600 | `brick.py` | Standard TVA |
| DEDEMAN SRL | 2816464 | `dedeman.py` | E-factura support |
| KINETERRA CONCEPT SRL | 31180432 | `kineterra.py` | Non-VAT payer (returnează `[]`) |
| GAMA INK SERVICE SRL | 17741882 | `gama_ink.py` | Standard TVA (toner, cartușe) |
| ELECTROBERING S.R.L. | 2744937 | `electrobering.py` | Standard TVA (electronice) |
| PICTUS VELUM SRL | 39634534 | `pictus_velum.py` | Standard TVA (rechizite) |
| UNLIMITED KEYS S.R.L. | 18993187 | `unlimited_keys.py` | Standard TVA, **NUMERAR** plată |
| BEST PRINT TRADE ACTIV SRL | 45417955 | `best_print.py` | **Non-VAT payer** (neplătitor TVA) |
| STEPOUT MARKET SRL | 35532655 | `stepout_market.py` | TVA 5% (cărți, librărie) |
---
## API Endpoints
| Endpoint | Metodă | Descriere |
|----------|--------|-----------|
| `/api/data-entry/ocr/profiles` | GET | Lista toate profilele |
| `/api/data-entry/ocr/profiles/{cui}` | GET | Detalii profil (acceptă RO prefix) |
| `/api/data-entry/ocr/profiles/reload` | POST | Hot-reload toate profilele |
### Exemple API
```bash
# Lista profile
curl http://localhost:8000/api/data-entry/ocr/profiles \
-H "Authorization: Bearer <token>"
# Detalii profil (cu sau fără RO prefix)
curl http://localhost:8000/api/data-entry/ocr/profiles/22891860
curl http://localhost:8000/api/data-entry/ocr/profiles/RO22891860
# Hot-reload după modificări
curl -X POST http://localhost:8000/api/data-entry/ocr/profiles/reload \
-H "Authorization: Bearer <token>"
# Response reload:
{
"success": true,
"reloaded_modules": 12,
"profiles_count": 12,
"registered_cuis": ["22891860", "11201891", "12546600", "10562600", ...],
"last_reload": "2026-01-06T22:37:05.000000"
}
```
---
## Cum funcționează sistemul
### Flow de extracție
```
ReceiptExtractor.extract()
├─► STEP 1: Extrage vendor + CUI
│ └─► _extract_vendor(), _extract_cui()
├─► ProfileRegistry.get_profile(cui)
│ └─► Returnează profil specific sau None
├─► STEP 2: Extracție cu profil (dacă există)
│ ├─► profile.extract_total()
│ ├─► profile.extract_date()
│ ├─► profile.extract_receipt_number()
│ ├─► profile.extract_tva_entries()
│ ├─► profile.extract_payment_methods()
│ └─► profile.extract_client_cui()
└─► STEP 3-4: Validare + post-procesare
```
### Fallback
Dacă nu există profil pentru CUI, se folosește logica generică din `ReceiptExtractor`.
---
## Structura unui profil
```python
from .base import BaseStoreProfile
from . import ProfileRegistry
@ProfileRegistry.register
class MagazinNouProfile(BaseStoreProfile):
"""Docstring cu descriere magazin."""
CUI_LIST = ["12345678"] # Poate avea mai multe CUI-uri
NAME_PATTERNS = ["MAGAZIN", "MAGAZIN NOU", "MAG4ZIN"] # OCR variants
STORE_NAME = "Magazin Nou SRL"
# Override doar ce e diferit de base class
def extract_tva_entries(self, text: str) -> List[dict]:
# Pattern-uri specifice magazinului
...
def get_validation_hints(self) -> Dict[str, Any]:
return {
"has_multi_rate_tva": False,
"card_equals_total": True,
"has_client_cui": False,
"has_efactura": False,
"is_non_vat_payer": False,
}
```
---
## Pattern-uri disponibile în base.py
BaseStoreProfile include pattern-uri generice OCR-tolerant:
| Pattern | Descriere |
|---------|-----------|
| `TOTAL_PATTERNS` | 8 variante pentru TOTAL (TOTAL:, TOTAL DE PLATA, etc.) |
| `DATE_PATTERNS` | 6 variante (DD.MM.YYYY, YYYY-MM-DD, DD/MM/YYYY) |
| `DATE_PATTERNS_OCR_SPACES` | 4 variante cu spații OCR ("2025. 08. 14") |
| `NUMBER_PATTERNS` | 11 variante pentru număr bon (NDS, BF, C3POS) |
| `PAYMENT_PATTERNS` | 8 variante pentru CARD/NUMERAR |
| `CLIENT_MARKERS` | 6 variante pentru secțiune CLIENT |
| `CLIENT_CUI_PATTERNS` | 7 variante pentru CUI client |
### Metode implementate în base class
- `extract_total(text)``Tuple[Decimal, float]`
- `extract_date(text)``Tuple[date, float]`
- `extract_receipt_number(text)``Tuple[str, float]`
- `extract_payment_methods(text)``List[dict]`
- `extract_client_cui(text)``Tuple[str, float]`
- `extract_client_name(text)``Tuple[str, float]`
---
## Când ai nevoie de profil custom?
| Situație | Exemplu | Ce trebuie override |
|----------|---------|---------------------|
| **Multi-rate TVA** | Lidl (TVA A, TVA B) | `extract_tva_entries()` |
| **Format dată special** | OMV/Socar (YYYY.MM.DD) | `DATE_PATTERNS_OCR_SPACES` |
| **B2B receipts** | Benzinării (au client CUI) | `extract_client_cui()` |
| **Non-VAT payer** | Kineterra | `extract_tva_entries()` returnează `[]` |
| **E-factura** | Dedeman | `extract_efactura_reference()` |
---
## Decizii de design
1. **Hot-reload manual** - endpoint `/profiles/reload` apelat când se modifică fișiere
2. **Persistență în Python** - profile în Git, version controlled
3. **Fallback graceful** - dacă nu există profil, folosește logica generică
4. **CUI normalization** - gestionează automat prefixul "RO" și whitespace
5. **Deduplicare TVA** - folosește `seen = set()` pentru a evita duplicate
---
## Comenzi utile
```bash
# Verifică syntax Python pentru toate profilele
for f in backend/modules/data_entry/services/ocr/profiles/*.py; do
python3 -m py_compile "$f" && echo "✓ $(basename $f)"
done
# Lista profile
ls -la backend/modules/data_entry/services/ocr/profiles/
# Pornește backend pentru testare
cd backend && source venv/bin/activate
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 1
# Test OCR pe un PDF
curl -X POST -F "file=@docs/data-entry/test.pdf" \
-H "Authorization: Bearer <token>" \
"http://localhost:8000/api/data-entry/ocr/extract?engine=doctr_plus"
```
---
## Script generare profile
`scripts/generate_store_profile.py` - generator automat de profile
```bash
# Vezi help
python scripts/generate_store_profile.py --help
# Funcționalități:
# - Analizează PDF-uri via OCR API
# - Detectează: TVA format, date format, payment patterns, B2B
# - Generează cod Python cu OCR error variants
# - Suportă glob patterns (*.pdf)
# - Verifică sintaxa după generare
```

View File

@@ -0,0 +1,388 @@
"""
Store Profiles Registry with Hot-Reload Support.
This module provides a registry for store-specific OCR extraction profiles.
Profiles can be reloaded at runtime without restarting the server.
Usage:
from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
# Get profile for a CUI
profile = ProfileRegistry.get_profile("22891860")
if profile:
tva_entries = profile.extract_tva_entries(text)
# Reload all profiles (after file changes)
count = ProfileRegistry.reload_all()
Architecture:
- ProfileRegistry: Singleton registry with class methods
- BaseStoreProfile: Abstract base class for profiles
- @ProfileRegistry.register: Decorator for profile classes
Hot-Reload Mechanism:
1. Admin calls POST /profiles/reload endpoint
2. Registry clears instance cache
3. importlib.reload() re-executes each profile module
4. @register decorator re-registers classes with new code
"""
from __future__ import annotations
import importlib
import logging
import sys
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Optional, Type, TYPE_CHECKING
if TYPE_CHECKING:
from .base import BaseStoreProfile
logger = logging.getLogger(__name__)
# Directory containing profile modules
PROFILES_DIR = Path(__file__).parent
class ProfileRegistry:
"""
Registry for store-specific OCR extraction profiles.
Uses class methods for singleton-like behavior without explicit instantiation.
Supports hot-reload via importlib.reload() for runtime updates.
Attributes:
_profiles: Maps CUI -> profile class (not instance)
_instances: Maps CUI -> profile instance (lazy, cleared on reload)
_last_reload: Timestamp of last reload
_loaded: Whether initial load has been performed
"""
# Class-level storage (singleton pattern via class methods)
_profiles: Dict[str, Type["BaseStoreProfile"]] = {}
_instances: Dict[str, "BaseStoreProfile"] = {}
_last_reload: Optional[datetime] = None
_loaded: bool = False
# -------------------------------------------------------------------------
# Registration
# -------------------------------------------------------------------------
@classmethod
def register(cls, profile_class: Type["BaseStoreProfile"]) -> Type["BaseStoreProfile"]:
"""
Decorator to register a store profile class.
Registers the profile for all CUIs in the class's CUI_LIST.
Safe for re-registration during hot-reload (overwrites existing).
Usage:
@ProfileRegistry.register
class LidlProfile(BaseStoreProfile):
CUI_LIST = ["22891860"]
...
Args:
profile_class: Profile class to register
Returns:
The same class (allows use as decorator)
Raises:
ValueError: If CUI_LIST is empty
"""
cui_list = getattr(profile_class, 'CUI_LIST', [])
store_name = getattr(profile_class, 'STORE_NAME', profile_class.__name__)
if not cui_list:
logger.warning(f"Profile {profile_class.__name__} has empty CUI_LIST, skipping")
return profile_class
# Register for each CUI
for cui in cui_list:
# Normalize CUI (remove RO prefix, strip whitespace)
normalized_cui = cls._normalize_cui(cui)
if normalized_cui in cls._profiles:
old_class = cls._profiles[normalized_cui]
logger.debug(
f"Re-registering CUI {normalized_cui}: "
f"{old_class.__name__} -> {profile_class.__name__}"
)
# Clear cached instance for this CUI
cls._instances.pop(normalized_cui, None)
cls._profiles[normalized_cui] = profile_class
logger.debug(f"Registered profile {profile_class.__name__} for CUI {normalized_cui}")
logger.info(f"Registered {store_name} for CUIs: {cui_list}")
return profile_class
# -------------------------------------------------------------------------
# Lookup
# -------------------------------------------------------------------------
@classmethod
def get_profile(cls, cui: Optional[str]) -> Optional["BaseStoreProfile"]:
"""
Get profile instance for a CUI.
Uses lazy instantiation - creates instance on first access.
Returns None if no profile is registered for this CUI.
Args:
cui: CUI to lookup (with or without RO prefix)
Returns:
Profile instance or None
"""
if not cui:
return None
# Ensure profiles are loaded
if not cls._loaded:
cls._load_all_profiles()
normalized_cui = cls._normalize_cui(cui)
# Check if profile exists
profile_class = cls._profiles.get(normalized_cui)
if not profile_class:
return None
# Lazy instantiation
if normalized_cui not in cls._instances:
try:
cls._instances[normalized_cui] = profile_class()
logger.debug(f"Instantiated {profile_class.__name__} for CUI {normalized_cui}")
except Exception as e:
logger.error(f"Failed to instantiate {profile_class.__name__}: {e}")
return None
return cls._instances[normalized_cui]
@classmethod
def has_profile(cls, cui: Optional[str]) -> bool:
"""Check if a profile exists for this CUI."""
if not cui:
return False
if not cls._loaded:
cls._load_all_profiles()
return cls._normalize_cui(cui) in cls._profiles
# -------------------------------------------------------------------------
# Listing
# -------------------------------------------------------------------------
@classmethod
def list_profiles(cls) -> List[Dict]:
"""
List all registered profiles.
Returns:
List of dicts with cui, class_name, store_name, name_patterns
"""
if not cls._loaded:
cls._load_all_profiles()
result = []
seen_classes = set()
for cui, profile_class in cls._profiles.items():
# Avoid duplicates for profiles with multiple CUIs
if profile_class.__name__ in seen_classes:
continue
seen_classes.add(profile_class.__name__)
result.append({
"cuis": list(getattr(profile_class, 'CUI_LIST', [])),
"class_name": profile_class.__name__,
"store_name": getattr(profile_class, 'STORE_NAME', profile_class.__name__),
"name_patterns": list(getattr(profile_class, 'NAME_PATTERNS', [])),
})
return result
@classmethod
def get_profile_info(cls, cui: str) -> Optional[Dict]:
"""
Get detailed info about a profile.
Args:
cui: CUI to lookup
Returns:
Dict with profile details or None
"""
profile = cls.get_profile(cui)
if not profile:
return None
return {
"cui": cui,
"cuis": list(profile.CUI_LIST),
"class_name": profile.__class__.__name__,
"store_name": profile.STORE_NAME,
"name_patterns": list(profile.NAME_PATTERNS),
"validation_hints": profile.get_validation_hints(),
}
# -------------------------------------------------------------------------
# Hot-Reload
# -------------------------------------------------------------------------
@classmethod
def reload_all(cls) -> int:
"""
Hot-reload all profile modules.
Clears instance cache and reloads all .py files in profiles directory.
Decorator re-registers classes with updated code.
Returns:
Number of modules reloaded
"""
logger.info("Starting profile hot-reload...")
# Clear instance cache (will be recreated on next get_profile)
cls._instances.clear()
# Get list of profile modules (exclude __init__, base)
module_names = cls._get_profile_module_names()
count = 0
for module_name in module_names:
full_name = f"backend.modules.data_entry.services.ocr.profiles.{module_name}"
try:
if full_name in sys.modules:
# Reload existing module
importlib.reload(sys.modules[full_name])
logger.debug(f"Reloaded module: {module_name}")
else:
# Import new module
importlib.import_module(full_name)
logger.debug(f"Imported new module: {module_name}")
count += 1
except Exception as e:
logger.error(f"Failed to reload {module_name}: {e}")
cls._last_reload = datetime.utcnow()
cls._loaded = True
logger.info(f"Profile hot-reload complete: {count} modules, {len(cls._profiles)} profiles")
return count
@classmethod
def get_reload_status(cls) -> Dict:
"""Get status of the registry including last reload time."""
return {
"loaded": cls._loaded,
"last_reload": cls._last_reload.isoformat() if cls._last_reload else None,
"profiles_count": len(cls._profiles),
"instances_count": len(cls._instances),
"registered_cuis": list(cls._profiles.keys()),
}
# -------------------------------------------------------------------------
# Internal methods
# -------------------------------------------------------------------------
@classmethod
def _normalize_cui(cls, cui: str) -> str:
"""
Normalize CUI for consistent lookup.
- Removes RO prefix (with or without space)
- Strips whitespace
- Converts to uppercase
Args:
cui: Raw CUI string
Returns:
Normalized CUI (digits only)
"""
if not cui:
return ""
cui = str(cui).strip().upper()
# Remove RO prefix (handles "RO12345" and "RO 12345")
if cui.startswith("RO"):
cui = cui[2:].lstrip()
return cui.strip()
@classmethod
def _get_profile_module_names(cls) -> List[str]:
"""
Get list of profile module names from profiles directory.
Excludes __init__.py and base.py.
Returns:
List of module names (without .py extension)
"""
excluded = {"__init__", "base", "__pycache__"}
modules = []
for path in PROFILES_DIR.glob("*.py"):
name = path.stem
if name not in excluded:
modules.append(name)
return sorted(modules)
@classmethod
def _load_all_profiles(cls) -> None:
"""
Initial load of all profile modules.
Called automatically on first get_profile() if not already loaded.
"""
if cls._loaded:
return
logger.info("Loading store profiles...")
module_names = cls._get_profile_module_names()
for module_name in module_names:
full_name = f"backend.modules.data_entry.services.ocr.profiles.{module_name}"
try:
importlib.import_module(full_name)
logger.debug(f"Loaded module: {module_name}")
except Exception as e:
logger.error(f"Failed to load {module_name}: {e}")
cls._loaded = True
cls._last_reload = datetime.utcnow()
logger.info(f"Loaded {len(cls._profiles)} store profiles")
@classmethod
def clear(cls) -> None:
"""
Clear all registered profiles.
Mainly useful for testing.
"""
cls._profiles.clear()
cls._instances.clear()
cls._loaded = False
cls._last_reload = None
# -------------------------------------------------------------------------
# Module exports
# -------------------------------------------------------------------------
__all__ = [
"ProfileRegistry",
"BaseStoreProfile",
]
# Re-export BaseStoreProfile for convenience
from .base import BaseStoreProfile

View File

@@ -0,0 +1,515 @@
"""
Base class for store-specific OCR extraction profiles.
Each store can have different receipt formats (TVA layout, total position, etc.).
Store profiles allow customizing extraction logic per-store for better accuracy.
Usage:
from .base import BaseStoreProfile
from . import ProfileRegistry
@ProfileRegistry.register
class LidlProfile(BaseStoreProfile):
CUI_LIST = ["22891860"]
NAME_PATTERNS = ["LIDL", "LDL"]
def extract_tva_entries(self, text: str) -> List[dict]:
# Custom Lidl TVA extraction logic
...
"""
import re
from abc import ABC
from decimal import Decimal, InvalidOperation
from typing import List, Optional, Tuple, Dict, Any
from datetime import date
class BaseStoreProfile(ABC):
"""
Abstract base class for store-specific extraction profiles.
Each profile defines:
- CUI_LIST: CUI codes that identify this store (without RO prefix)
- NAME_PATTERNS: OCR-tolerant name patterns for fallback matching
- Custom extraction methods for TVA, total, date, etc.
The ProfileRegistry uses CUI_LIST to lookup profiles during extraction.
"""
# -------------------------------------------------------------------------
# Class attributes - override in subclasses
# -------------------------------------------------------------------------
# List of CUI codes (without RO prefix) that identify this store
CUI_LIST: List[str] = []
# OCR-tolerant name patterns for fallback matching
NAME_PATTERNS: List[str] = []
# Store display name
STORE_NAME: str = "Unknown Store"
# -------------------------------------------------------------------------
# Generic patterns - can be overridden in subclasses
# -------------------------------------------------------------------------
# Total amount patterns (confidence-weighted)
TOTAL_PATTERNS = [
(r'T[O0]TAL[.\s]+L[E3][I1!]\s*:?\s*([\d\s.,]+)', 0.98),
(r'TOTAL\s+LEI\s*([\d\s.,]+)', 0.98),
(r'[OT]?OTAL\s+LEI\s*([\d\s.,]+)', 0.95),
(r'TOTAL\s*:?\s*([\d\s.,]+)\s*(?:RON|LEI)?', 0.95),
(r'TOTAL\s+(?:RON|LEI)\s*([\d\s.,]+)', 0.95),
(r'SUBTOTAL\s*([\d\s.,]+)', 0.90),
(r'DE\s+PLATA\s*:?\s*([\d\s.,]+)', 0.90),
(r'SUMA\s*:?\s*([\d\s.,]+)', 0.85),
]
# Date patterns (confidence-weighted)
DATE_PATTERNS = [
(r'D[AR]TA\s*:?\s*(\d{2}[-./]\d{2}[-./]\d{4})', 0.98),
(r'DATA\s*:?\s*(\d{2}[-./]\d{2}[-./]\d{4})', 0.98),
(r'(\d{2}[-./]\d{2}[-./]\d{4})\s+[O0]RA\s*:?\s*\d{2}:\d{2}', 0.95),
(r'(\d{2}[-./]\d{2}[-./]\d{4})\s+\d{2}:\d{2}', 0.90),
(r'(\d{2}[-./]\d{2}[-./]\d{4})', 0.80),
(r'(\d{4}[-./]\d{2}[-./]\d{2})', 0.75),
]
# Date patterns with OCR-introduced spaces (separate because format is different)
DATE_PATTERNS_OCR_SPACES = [
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})\s+\d{2}:\d{2}', 0.92, 'ymd'),
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})', 0.85, 'ymd'),
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})\s+\d{2}:\d{2}', 0.92, 'dmy'),
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})', 0.85, 'dmy'),
]
# Receipt number patterns (confidence-weighted)
NUMBER_PATTERNS = [
(r'NDS\s*:?\s*(\d+)', 0.98),
(r'C3POS[-A-Z0-9]*[N:](\d{6,7})', 0.98),
(r'C3POS.*?(\d{6,7})\b', 0.95),
(r'BF\s*:\s*(\d{4,})', 0.96),
(r'BF\s+(\d{4,})', 0.93),
(r'NIVS\s*:?\s*(\d+)', 0.95),
(r'NR\.?\s*BON\s*:?\s*(\d+)', 0.95),
(r'BON\s+(?:FISCAL\s+)?NR\.?\s*:?\s*(\d+)', 0.95),
(r'CHITANTA\s+NR\.?\s*:?\s*(\d+)', 0.95),
(r'NR\.?\s+DOCUMENT\s*:?\s*(\d+)', 0.90),
(r'ID\s*BF\s*:?\s*(\d+)', 0.90),
]
# Payment method patterns (pattern, method_type, confidence)
PAYMENT_PATTERNS = [
(r'CARTE\s+CREDIT\s*:?\s*([\d\s.,]+)', 'CARD', 0.98),
(r'CARTE\s+CREDIT\s*:?\s*\n\s*([\d\s.,]+)', 'CARD', 0.97),
(r'(?:PLATA\s+)?CARD\s*[:\sA-Z]?\s*([\d\s.,]+)', 'CARD', 0.95),
(r'NUMERAR\s*:?\s*([\d\s.,]+)', 'NUMERAR', 0.95),
(r'CASH\s*:?\s*([\d\s.,]+)', 'NUMERAR', 0.90),
(r'(?:^|\n|\s)RD\s*:?\s*(\d{1,6}[.,]\d{2})\b', 'CARD', 0.70),
(r'(?:^|\n|\s)ARD\s*:?\s*(\d{1,6}[.,]\d{2})\b', 'CARD', 0.75),
(r'(?:^|\n|\s)MERAR\s*:?\s*(\d{1,6}[.,]\d{2})\b', 'NUMERAR', 0.70),
]
# Client section markers (for B2B receipts)
CLIENT_MARKERS = [
r'C\.?\s*[I1]\.?\s*F\.?\s+CLIENT\s*:',
r'C\.?\s*U\.?\s*[I1]\.?\s+CLIENT\s*:',
r'CLIENT\s+C\.?\s*[UI1]\.?\s*[IF1]\.?\s*:',
r'CLIENT\s*:',
r'CUMPARATOR\s*:',
r'BENEFICIAR\s*:',
]
# Client CUI patterns (pattern, confidence)
CLIENT_CUI_PATTERNS = [
(r'(R[O0]\d{6,10})\s*\n\s*CLIENT\s+C\.?\s*U\.?\s*[I1]\.?', 0.99),
(r'(R[O0]\d{6,10})\s*:?\s*\n\s*CLIENT', 0.98),
(r'C[I1]F\s+[A-Z]*\s*CLIENT\s*:?\s*(R[O0]\d{6,10})', 0.98),
(r'C\.?\s*[I1]\.?\s*F\.?\s+CLIENT\s*:?\s*(R[O0]?\d{6,10})', 0.98),
(r'C\.?\s*U\.?\s*[I1]\.?\s+CLIENT\s*:?\s*(R[O0]?\d{6,10})', 0.98),
(r'CLIENT\s+C\.?\s*U\.?\s*[I1]\.?\s*:?\s*(R[O0]?\d{6,10})', 0.95),
(r'CLIENT\s*:?\s*(R[O0]?\d{6,10})', 0.90),
]
# Company type indicators (for identifying company names)
COMPANY_INDICATORS = [
r'\bS\.?\s*R\.?\s*L\.?\b', # S.R.L. or S. R. L.
r'\bS\.?\s*A\.?\b', # S.A. or S. A.
r'\bS\.?\s*N\.?\s*C\.?\b', # S.N.C. or S. N. C.
r'\bS\.?\s*C\.?\s*S\.?\b', # S.C.S. or S. C. S.
r'\bI\.?\s*I\.?\b', # I.I. or I. I.
r'\bP\.?\s*F\.?\s*A\.?\b', # P.F.A. or P. F. A.
r'\bS\.?\s*C\.?\s+[A-Z]', # S.C. followed by company name
r'HOLDING',
r'COMPANY',
r'GROUP',
]
# Maximum reasonable payment amount (to filter OCR errors)
MAX_PAYMENT = Decimal('100000')
# -------------------------------------------------------------------------
# Extraction methods - override in subclasses as needed
# -------------------------------------------------------------------------
def extract_tva_entries(self, text: str) -> List[dict]:
"""
Extract TVA entries from receipt text.
Override this method in subclasses to handle store-specific TVA formats.
Args:
text: Raw OCR text from receipt
Returns:
List of dicts with keys: code, percent, amount
"""
return []
def extract_total(self, text: str) -> Tuple[Optional[Decimal], float]:
"""
Extract total amount from receipt text.
Args:
text: Raw OCR text from receipt
Returns:
Tuple of (amount, confidence) or (None, 0.0)
"""
text_upper = text.upper()
for pattern, confidence in self.TOTAL_PATTERNS:
match = re.search(pattern, text_upper)
if match:
amount = self._parse_decimal(match.group(1))
if amount and amount > 0 and amount < self.MAX_PAYMENT:
return (amount, confidence)
return (None, 0.0)
def extract_date(self, text: str) -> Tuple[Optional[date], float]:
"""
Extract receipt date from text.
Args:
text: Raw OCR text from receipt
Returns:
Tuple of (date, confidence) or (None, 0.0)
"""
text_upper = text.upper()
# Try standard patterns first
for pattern, confidence in self.DATE_PATTERNS:
match = re.search(pattern, text_upper)
if match:
parsed = self._parse_date(match.group(1))
if parsed:
return (parsed, confidence)
# Try OCR-corrupted patterns with spaces
for pattern, confidence, fmt in self.DATE_PATTERNS_OCR_SPACES:
match = re.search(pattern, text_upper)
if match:
try:
if fmt == 'ymd':
year, month, day = int(match.group(1)), int(match.group(2)), int(match.group(3))
else: # dmy
day, month, year = int(match.group(1)), int(match.group(2)), int(match.group(3))
if 1 <= day <= 31 and 1 <= month <= 12 and 2000 <= year <= 2100:
return (date(year, month, day), confidence)
except (ValueError, TypeError):
continue
return (None, 0.0)
def extract_receipt_number(self, text: str) -> Tuple[Optional[str], float]:
"""
Extract receipt number from text.
Args:
text: Raw OCR text from receipt
Returns:
Tuple of (number, confidence) or (None, 0.0)
"""
text_upper = text.upper()
for pattern, confidence in self.NUMBER_PATTERNS:
match = re.search(pattern, text_upper)
if match:
number = match.group(1).strip()
if number and len(number) >= 3:
return (number, confidence)
return (None, 0.0)
def extract_payment_methods(self, text: str) -> List[dict]:
"""
Extract payment methods (CARD/NUMERAR) from receipt.
Supports multiple payments of the same type (e.g., 2x CARD for split payments).
Each payment is returned as a separate entry with its amount.
Args:
text: Raw OCR text from receipt
Returns:
List of dicts: [{'method': 'CARD'/'NUMERAR', 'amount': Decimal, 'confidence': float}]
Multiple entries of same method type are allowed for split payments.
"""
text_upper = text.upper()
methods = []
# Track (method, amount) pairs to avoid exact duplicates from overlapping patterns
seen_entries = set()
for pattern, method, confidence in self.PAYMENT_PATTERNS:
for match in re.finditer(pattern, text_upper):
try:
amount = self._parse_decimal(match.group(1))
if amount and amount > 0 and amount < self.MAX_PAYMENT:
# Deduplicate by (method, amount) to avoid same entry from multiple patterns
# But allow different amounts for same method (split payments)
entry_key = (method, amount)
if entry_key not in seen_entries:
methods.append({
'method': method,
'amount': amount,
'confidence': confidence
})
seen_entries.add(entry_key)
except (ValueError, InvalidOperation):
continue
return methods
def extract_client_cui(self, text: str) -> Tuple[Optional[str], float]:
"""
Extract client CUI from B2B receipts.
Args:
text: Raw OCR text from receipt
Returns:
Tuple of (cui, confidence) or (None, 0.0)
"""
text_upper = text.upper()
# First check if there's a CLIENT section
has_client_section = any(
re.search(marker, text_upper, re.IGNORECASE)
for marker in self.CLIENT_MARKERS
)
if not has_client_section:
return (None, 0.0)
# Try to extract CUI
for pattern, confidence in self.CLIENT_CUI_PATTERNS:
match = re.search(pattern, text_upper, re.IGNORECASE | re.MULTILINE)
if match:
cui = match.group(1)
# Normalize: remove RO prefix for storage
cui_digits = re.sub(r'[^0-9]', '', cui)
if 6 <= len(cui_digits) <= 10:
return (cui_digits, confidence)
return (None, 0.0)
def extract_client_name(self, text: str) -> Tuple[Optional[str], float]:
"""
Extract client/buyer company name from B2B receipts.
Args:
text: Raw OCR text from receipt
Returns:
Tuple of (client_name, confidence) or (None, 0.0)
"""
text_upper = text.upper()
lines = text.split('\n')
# First check if there's a CLIENT section
client_section_idx = None
for i, line in enumerate(lines):
line_upper = line.upper().strip()
if any(re.search(marker, line_upper, re.IGNORECASE) for marker in self.CLIENT_MARKERS):
client_section_idx = i
break
if client_section_idx is None:
return (None, 0.0)
# Look for company name in CLIENT section
line = lines[client_section_idx].strip()
line_upper = line.upper()
# Strategy 1: Check if name is on same line after ":"
if ':' in line:
name_part = line.split(':', 1)[1].strip()
if name_part and len(name_part) >= 3:
# Skip if it looks like a CUI (RO followed by digits)
if re.match(r'^R[O0]?\d{6,10}$', name_part.upper()):
pass # This is CUI, not name - continue to next strategy
else:
# Check for company indicators
name_upper = name_part.upper()
if any(re.search(ind, name_upper) for ind in self.COMPANY_INDICATORS):
return (self._clean_company_name(name_part), 0.95)
elif len(name_part) >= 5 and not name_part.isdigit():
return (self._clean_company_name(name_part), 0.80)
# Strategy 2: Check next line for company name
if client_section_idx + 1 < len(lines):
next_line = lines[client_section_idx + 1].strip()
next_upper = next_line.upper()
# Skip if it's a CUI/CIF line or looks like CUI
if not re.search(r'C\.?\s*[UI]\.?\s*[IF]\.?', next_upper):
if not re.match(r'^R[O0]?\d{6,10}$', next_upper):
if any(re.search(ind, next_upper) for ind in self.COMPANY_INDICATORS):
return (self._clean_company_name(next_line), 0.90)
elif len(next_line) >= 5 and not next_line.isdigit():
# Check it's not CUI/CIF/COD keywords
if not any(kw in next_upper for kw in ['CUI', 'CIF', 'COD', 'FISCAL']):
return (self._clean_company_name(next_line), 0.75)
# Strategy 3: Look for any line with company indicators in CLIENT section region
search_end = min(client_section_idx + 5, len(lines))
for i in range(client_section_idx + 1, search_end):
line = lines[i].strip()
line_upper = line.upper()
# Skip CUI/CIF lines
if re.search(r'C\.?\s*[UI]\.?\s*[IF]\.?', line_upper):
continue
if re.match(r'^R[O0]?\d{6,10}$', line_upper):
continue
if any(re.search(ind, line_upper) for ind in self.COMPANY_INDICATORS):
return (self._clean_company_name(line), 0.85)
return (None, 0.0)
@staticmethod
def _clean_company_name(name: str) -> str:
"""Clean company name for storage."""
if not name:
return ""
# Remove extra whitespace
name = re.sub(r'\s+', ' ', name).strip()
# Remove trailing punctuation except periods in S.R.L., S.A., etc.
name = re.sub(r'[,;:]+$', '', name).strip()
return name
# -------------------------------------------------------------------------
# Validation hints - override to customize validation behavior
# -------------------------------------------------------------------------
def get_validation_hints(self) -> Dict[str, Any]:
"""
Return validation hints for this store.
Returns:
Dict with validation hints. Common keys:
- has_multi_rate_tva: bool - Store uses multiple TVA rates
- card_equals_total: bool - CARD payment equals total
- has_client_cui: bool - Receipt includes client CUI
- has_efactura: bool - Store uses e-factura format
- is_non_vat_payer: bool - Store is not a VAT payer
"""
return {}
# -------------------------------------------------------------------------
# Helper methods - available to all subclasses
# -------------------------------------------------------------------------
@staticmethod
def _normalize_number(text: str) -> str:
"""
Normalize a number string for Decimal conversion.
Handles Romanian formats: "1.234,56" -> "1234.56"
"""
if not text:
return "0"
# Remove spaces
text = text.replace(" ", "")
# Determine decimal separator
last_comma = text.rfind(",")
last_dot = text.rfind(".")
if last_comma > last_dot:
text = text.replace(".", "").replace(",", ".")
elif last_dot > last_comma:
text = text.replace(",", "")
else:
text = text.replace(",", ".")
return text
@staticmethod
def _parse_decimal(text: str) -> Optional[Decimal]:
"""Parse a string to Decimal, handling various formats."""
try:
normalized = BaseStoreProfile._normalize_number(text)
return Decimal(normalized)
except (InvalidOperation, ValueError, TypeError):
return None
@staticmethod
def _parse_date(text: str) -> Optional[date]:
"""
Parse date string in various formats.
Supports: DD-MM-YYYY, DD/MM/YYYY, DD.MM.YYYY, YYYY-MM-DD
"""
if not text:
return None
# Normalize separators
text = text.replace('/', '-').replace('.', '-')
try:
parts = text.split('-')
if len(parts) != 3:
return None
# Determine format based on first part length
if len(parts[0]) == 4:
# YYYY-MM-DD
year, month, day = int(parts[0]), int(parts[1]), int(parts[2])
else:
# DD-MM-YYYY
day, month, year = int(parts[0]), int(parts[1]), int(parts[2])
# Validate ranges
if 1 <= day <= 31 and 1 <= month <= 12 and 2000 <= year <= 2100:
return date(year, month, day)
except (ValueError, TypeError, IndexError):
pass
return None
@staticmethod
def _clean_text(text: str) -> str:
"""Clean OCR text for pattern matching."""
if not text:
return ""
text = re.sub(r'\s+', ' ', text)
text = re.sub(r'[\x00-\x09\x0b\x0c\x0e-\x1f\x7f]', '', text)
return text.strip()
# -------------------------------------------------------------------------
# Magic methods
# -------------------------------------------------------------------------
def __repr__(self) -> str:
return f"<{self.__class__.__name__} CUI={self.CUI_LIST}>"
def __str__(self) -> str:
return f"{self.STORE_NAME} ({', '.join(self.CUI_LIST)})"

View File

@@ -0,0 +1,54 @@
"""
BEST PRINT TRADE ACTIV SRL store profile for OCR extraction.
Stamp manufacturing service. Non-VAT payer (neplătitor de TVA).
"""
from typing import List, Dict, Any
from .base import BaseStoreProfile
from . import ProfileRegistry
@ProfileRegistry.register
class BestPrintProfile(BaseStoreProfile):
"""
BEST PRINT TRADE ACTIV SRL - non-VAT payer profile.
Key characteristics:
- Non-VAT payer (neplătitor de TVA) - NO TVA on receipts
- Stamp manufacturing and printing services
- Total amount has no TVA component
- CARD payment typical
"""
CUI_LIST = ["45417955"]
NAME_PATTERNS = ["BEST PRINT", "BESTPRINT", "BEST PRINT TRADE", "BEST PR1NT"]
STORE_NAME = "BEST PRINT TRADE ACTIV SRL"
def extract_tva_entries(self, text: str) -> List[dict]:
"""
Extract TVA entries - returns empty for non-VAT payer.
BEST PRINT is a non-VAT payer (neplătitor de TVA),
so no TVA entries are expected on receipts.
Args:
text: Raw OCR text from receipt (unused)
Returns:
Empty list (non-VAT payer has no TVA)
"""
# Non-VAT payer - no TVA entries
return []
def get_validation_hints(self) -> Dict[str, Any]:
"""Return BEST PRINT-specific validation hints."""
return {
"has_multi_rate_tva": False,
"card_equals_total": True,
"has_client_cui": True, # May have client CUI
"has_efactura": False,
"is_non_vat_payer": True, # CRITICAL: Non-VAT payer
"tva_pattern": "none",
}

View File

@@ -0,0 +1,101 @@
"""
BRICK (Five-Holding) store profile for OCR extraction.
Five-Holding S.A. operates BRICK stores with standard receipt format.
"""
import re
from decimal import Decimal, InvalidOperation
from typing import List, Dict, Any
from .base import BaseStoreProfile
from . import ProfileRegistry
@ProfileRegistry.register
class BrickProfile(BaseStoreProfile):
"""
FIVE-HOLDING S.A. (BRICK) - standard TVA format.
Key characteristics:
- Standard TVA format
- Single TVA rate typically
- No client CUI on receipts
"""
CUI_LIST = ["10562600"]
NAME_PATTERNS = ["BRICK", "FIVE-HOLDING", "FIVE HOLDING", "BR1CK"] # OCR variants
STORE_NAME = "FIVE-HOLDING S.A."
# Standard TVA patterns (flexible - accepts any rate)
TVA_PATTERNS = [
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
# "A - XX,XX% = YY,YY"
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
# Simple: "TVA XX% YY,YY"
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
]
def extract_tva_entries(self, text: str) -> List[dict]:
"""
Extract BRICK-specific TVA entries.
Args:
text: Raw OCR text from receipt
Returns:
List of TVA entries with code, percent, and amount
"""
entries = []
seen = set()
# Try coded patterns first
for pattern in self.TVA_PATTERNS[:2]:
for match in re.finditer(pattern, text, re.IGNORECASE):
try:
code = match.group(1).upper()
percent = int(match.group(2))
amount = self._parse_decimal(match.group(3))
if amount and amount > 0:
entry_key = (code, percent)
if entry_key not in seen:
entries.append({
'code': code,
'percent': percent,
'amount': amount
})
seen.add(entry_key)
except (ValueError, InvalidOperation, IndexError):
continue
# Fallback to simple format
if not entries:
simple_pattern = self.TVA_PATTERNS[2]
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
try:
percent = int(match.group(1))
amount = self._parse_decimal(match.group(2))
if amount and amount > 0:
entries.append({
'code': 'A',
'percent': percent,
'amount': amount
})
break
except (ValueError, InvalidOperation):
continue
return entries
def get_validation_hints(self) -> Dict[str, Any]:
"""Return BRICK-specific validation hints."""
return {
"has_multi_rate_tva": False,
"card_equals_total": False,
"has_client_cui": False,
"has_efactura": False,
"is_non_vat_payer": False,
}

View File

@@ -0,0 +1,118 @@
"""
DEDEMAN store profile for OCR extraction.
Dedeman receipts may include e-factura information and use standard TVA format.
Large DIY retailer in Romania.
"""
import re
from decimal import Decimal, InvalidOperation
from typing import List, Dict, Any
from .base import BaseStoreProfile
from . import ProfileRegistry
@ProfileRegistry.register
class DedemanProfile(BaseStoreProfile):
"""
DEDEMAN SRL - standard TVA with e-factura support.
Key characteristics:
- Standard TVA format
- May include e-factura reference number
- Professional receipts for construction materials
"""
CUI_LIST = ["2816464"]
NAME_PATTERNS = ["DEDEMAN", "DEDEMAN SRL", "OEDEMAN", "D3DEMAN"] # OCR variants
STORE_NAME = "DEDEMAN SRL"
# Standard TVA patterns (flexible - accepts any rate)
TVA_PATTERNS = [
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
# "A - XX,XX% = YY,YY"
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
# "TVA (XX%) YY,YY"
r'TVA\s*\(?\s*(\d{1,2})\s*%\s*\)?\s*:?\s*([\d.,]+)',
]
# E-factura pattern for reference extraction
EFACTURA_PATTERN = r'e-?factura\s*:?\s*([A-Z0-9]+)'
def extract_tva_entries(self, text: str) -> List[dict]:
"""
Extract Dedeman-specific TVA entries.
Args:
text: Raw OCR text from receipt
Returns:
List of TVA entries with code, percent, and amount
"""
entries = []
seen = set()
# Try coded patterns first
for pattern in self.TVA_PATTERNS[:2]:
for match in re.finditer(pattern, text, re.IGNORECASE):
try:
code = match.group(1).upper()
percent = int(match.group(2))
amount = self._parse_decimal(match.group(3))
if amount and amount > 0:
entry_key = (code, percent)
if entry_key not in seen:
entries.append({
'code': code,
'percent': percent,
'amount': amount
})
seen.add(entry_key)
except (ValueError, InvalidOperation, IndexError):
continue
# Fallback to simple format
if not entries:
simple_pattern = self.TVA_PATTERNS[2]
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
try:
percent = int(match.group(1))
amount = self._parse_decimal(match.group(2))
if amount and amount > 0:
entries.append({
'code': 'A',
'percent': percent,
'amount': amount
})
break
except (ValueError, InvalidOperation):
continue
return entries
def extract_efactura_reference(self, text: str) -> str | None:
"""
Extract e-factura reference number if present.
Args:
text: Raw OCR text from receipt
Returns:
E-factura reference string or None
"""
match = re.search(self.EFACTURA_PATTERN, text, re.IGNORECASE)
return match.group(1) if match else None
def get_validation_hints(self) -> Dict[str, Any]:
"""Return Dedeman-specific validation hints."""
return {
"has_multi_rate_tva": False,
"card_equals_total": False,
"has_client_cui": False,
"has_efactura": True,
"is_non_vat_payer": False,
}

View File

@@ -0,0 +1,102 @@
"""
ELECTROBERING S.R.L. store profile for OCR extraction.
Electronics and home supplies store.
"""
import re
from decimal import Decimal, InvalidOperation
from typing import List, Dict, Any
from .base import BaseStoreProfile
from . import ProfileRegistry
@ProfileRegistry.register
class ElectroberingProfile(BaseStoreProfile):
"""
ELECTROBERING S.R.L. - standard TVA profile.
Key characteristics:
- Standard TVA format (single rate, any percentage)
- Electronics and home supplies
- May have client CUI for B2B purchases
- CARD payment typical
"""
CUI_LIST = ["2744937"]
NAME_PATTERNS = ["ELECTROBERING", "ELECTR0BERING", "ELECTROBERING SRL"]
STORE_NAME = "ELECTROBERING S.R.L."
# Standard TVA patterns (flexible - accepts any rate)
TVA_PATTERNS = [
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
# "A - XX,XX% = YY,YY"
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
# "TVA XX% YY,YY" (simple format without code)
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
]
def extract_tva_entries(self, text: str) -> List[dict]:
"""
Extract TVA entries from receipt text.
Args:
text: Raw OCR text from receipt
Returns:
List of TVA entries with code, percent, and amount
"""
entries = []
seen = set()
# Try coded patterns first
for pattern in self.TVA_PATTERNS[:2]:
for match in re.finditer(pattern, text, re.IGNORECASE):
try:
code = match.group(1).upper()
percent = int(match.group(2))
amount = self._parse_decimal(match.group(3))
if amount and amount > 0:
entry_key = (code, percent)
if entry_key not in seen:
entries.append({
'code': code,
'percent': percent,
'amount': amount
})
seen.add(entry_key)
except (ValueError, InvalidOperation, IndexError):
continue
# Fallback to simple format
if not entries:
simple_pattern = self.TVA_PATTERNS[2]
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
try:
percent = int(match.group(1))
amount = self._parse_decimal(match.group(2))
if amount and amount > 0:
entries.append({
'code': 'A',
'percent': percent,
'amount': amount
})
break
except (ValueError, InvalidOperation):
continue
return entries
def get_validation_hints(self) -> Dict[str, Any]:
"""Return ELECTROBERING-specific validation hints."""
return {
"has_multi_rate_tva": False,
"card_equals_total": True,
"has_client_cui": True, # May have client CUI for B2B
"has_efactura": False,
"is_non_vat_payer": False,
}

View File

@@ -0,0 +1,103 @@
"""
GAMA INK SERVICE SRL store profile for OCR extraction.
Toner refill and printer supplies store.
"""
import re
from decimal import Decimal, InvalidOperation
from typing import List, Dict, Any
from .base import BaseStoreProfile
from . import ProfileRegistry
@ProfileRegistry.register
class GamaInkProfile(BaseStoreProfile):
"""
GAMA INK SERVICE SRL - standard TVA profile.
Key characteristics:
- Standard TVA format (single rate, any percentage)
- Service-based (toner refill, printer supplies)
- CARD payment typical
"""
CUI_LIST = ["17741882"]
NAME_PATTERNS = ["GAMA INK", "GAMA", "GAMAINK", "GAMA INK SERVICE"]
STORE_NAME = "GAMA INK SERVICE SRL"
# Standard TVA patterns (flexible - accepts any rate)
TVA_PATTERNS = [
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
# "A - XX,XX% = YY,YY"
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
# "TVA XX% YY,YY" (simple format without code)
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
# "TVA: YY,YY" (amount only, percent inferred)
r'TVA\s*:?\s*([\d.,]+)\s*(?:LEI|RON)?',
]
def extract_tva_entries(self, text: str) -> List[dict]:
"""
Extract TVA entries from receipt text.
Args:
text: Raw OCR text from receipt
Returns:
List of TVA entries with code, percent, and amount
"""
entries = []
seen = set()
# Try coded patterns first (have both code and percent)
for pattern in self.TVA_PATTERNS[:2]:
for match in re.finditer(pattern, text, re.IGNORECASE):
try:
code = match.group(1).upper()
percent = int(match.group(2))
amount = self._parse_decimal(match.group(3))
if amount and amount > 0:
entry_key = (code, percent)
if entry_key not in seen:
entries.append({
'code': code,
'percent': percent,
'amount': amount
})
seen.add(entry_key)
except (ValueError, InvalidOperation, IndexError):
continue
# Fallback to simple format (percent + amount without code)
if not entries:
simple_pattern = self.TVA_PATTERNS[2]
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
try:
percent = int(match.group(1))
amount = self._parse_decimal(match.group(2))
if amount and amount > 0:
entries.append({
'code': 'A',
'percent': percent,
'amount': amount
})
break
except (ValueError, InvalidOperation):
continue
return entries
def get_validation_hints(self) -> Dict[str, Any]:
"""Return GAMA INK-specific validation hints."""
return {
"has_multi_rate_tva": False,
"card_equals_total": True,
"has_client_cui": False,
"has_efactura": False,
"is_non_vat_payer": False,
}

View File

@@ -0,0 +1,53 @@
"""
KINETERRA store profile for OCR extraction.
Kineterra is a non-VAT payer (neplătitor de TVA).
Receipts don't include TVA breakdown.
"""
from typing import List, Dict, Any
from .base import BaseStoreProfile
from . import ProfileRegistry
@ProfileRegistry.register
class KineterraProfile(BaseStoreProfile):
"""
KINETERRA CONCEPT SRL - non-VAT payer profile.
Key characteristics:
- Non-VAT payer (neplătitor de TVA)
- No TVA breakdown on receipts
- Total amount has no TVA component
"""
CUI_LIST = ["31180432"]
NAME_PATTERNS = ["KINETERRA", "KINETERRA CONCEPT", "K1NETERRA"] # OCR variants
STORE_NAME = "KINETERRA CONCEPT SRL"
def extract_tva_entries(self, text: str) -> List[dict]:
"""
Extract TVA entries - returns empty for non-VAT payer.
Kineterra is a non-VAT payer, so no TVA entries are expected.
Args:
text: Raw OCR text from receipt (unused)
Returns:
Empty list (non-VAT payer has no TVA)
"""
# Non-VAT payer - no TVA entries
return []
def get_validation_hints(self) -> Dict[str, Any]:
"""Return Kineterra-specific validation hints."""
return {
"has_multi_rate_tva": False,
"card_equals_total": False,
"has_client_cui": False,
"has_efactura": False,
"is_non_vat_payer": True,
"tva_pattern": "none",
}

View File

@@ -0,0 +1,93 @@
"""
LIDL store profile for OCR extraction.
Lidl receipts have a specific TVA format without hyphen/colon separators:
TOTAL TVA 9,84
TVA A 21,00% 7,71
TVA B 11,00% 2,13
This profile handles multi-rate TVA extraction for Lidl receipts.
"""
import re
from decimal import Decimal, InvalidOperation
from typing import List, Dict, Any
from .base import BaseStoreProfile
from . import ProfileRegistry
@ProfileRegistry.register
class LidlProfile(BaseStoreProfile):
"""
LIDL DISCOUNT S.R.L. - multi-rate TVA profile.
Key characteristics:
- Multi-rate TVA (codes A, B, C, D with any percentage - patterns are flexible)
- TVA format: "TVA A XX,XX% YY,YY" (code + percent + amount on same line)
- Supports historical rates (19%, 9%, 5%) and current rates (21%, 11%)
- CARD payment usually equals total
- No client CUI on receipts
"""
CUI_LIST = ["22891860"]
NAME_PATTERNS = ["LIDL", "LDL", "L1DL", "LIDL DISCOUNT"] # OCR variants
STORE_NAME = "LIDL DISCOUNT S.R.L."
# Lidl-specific TVA patterns
# Format: "TVA A 21,00% 7,71" (code + percent + amount on same line)
TVA_PATTERNS = [
# Primary: "TVA A 21,00% 7.71" with various spacing
r'T[VU][AR]\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
# With backslash OCR artifact: "TVA A \21,00% 7.71"
r'T[VU][AR]\s+([A-D])\s+\\?(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
# IVA variant (rare OCR misread)
r'IVA\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
]
def extract_tva_entries(self, text: str) -> List[dict]:
"""
Extract Lidl-specific TVA entries.
Handles multiple TVA rates (A, B, C, D) commonly found on Lidl receipts.
Uses deduplication to avoid counting the same entry twice from different patterns.
Args:
text: Raw OCR text from receipt
Returns:
List of TVA entries with code, percent, and amount
"""
entries = []
seen = set() # Deduplication key: (code, percent)
for pattern in self.TVA_PATTERNS:
for match in re.finditer(pattern, text, re.IGNORECASE):
try:
code = match.group(1).upper()
percent = int(match.group(2))
amount = self._parse_decimal(match.group(3))
if amount and amount > 0:
entry_key = (code, percent)
if entry_key not in seen:
entries.append({
'code': code,
'percent': percent,
'amount': amount
})
seen.add(entry_key)
except (ValueError, InvalidOperation):
continue
return entries
def get_validation_hints(self) -> Dict[str, Any]:
"""Return Lidl-specific validation hints."""
return {
"has_multi_rate_tva": True,
"card_equals_total": True,
"has_client_cui": False,
"has_efactura": False,
"is_non_vat_payer": False,
}

View File

@@ -0,0 +1,99 @@
"""
OMV Petrom store profile for OCR extraction.
OMV receipts typically include client CUI and use standard TVA format.
Common at gas stations with fuel purchases.
Date format: YYYY. MM. DD with spaces (e.g., "2025. 08. 14")
"""
import re
from datetime import date
from decimal import Decimal, InvalidOperation
from typing import List, Dict, Any, Tuple, Optional
from .base import BaseStoreProfile
from . import ProfileRegistry
@ProfileRegistry.register
class OMVProfile(BaseStoreProfile):
"""
OMV PETROM MARKETING S.R.L. - standard TVA with client CUI.
Key characteristics:
- Standard TVA format (usually single rate, any percentage)
- Includes client CUI on receipt (for business purchases)
- TVA table format: "A-XX,XX% base_amount tva_amount"
- Supports historical rates (19%) and current rates (21%)
- Date format: YYYY. MM. DD (with spaces)
"""
CUI_LIST = ["11201891"]
NAME_PATTERNS = ["OMV", "PETROM", "OMV PETROM", "0MV"] # OCR variants
STORE_NAME = "OMV PETROM MARKETING S.R.L."
# OMV TVA table pattern: "A-19,00% 285,66 49,58" (code-percent base tva)
TVA_TABLE_PATTERN = r'([A-D])\s*[-:]\s*(\d{1,2})[.,]\d{2}\s*%\s+([\d.,]+)\s+([\d.,]+)'
# Standard TVA pattern fallback
TVA_STANDARD_PATTERN = r'TVA\s*:?\s*([\d.,]+)'
# OMV specific: prioritize YYYY. MM. DD format with spaces
DATE_PATTERNS_OCR_SPACES = [
# YYYY. MM. DD with time (OMV format)
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})\s+\d{2}:\d{2}', 0.98, 'ymd'),
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})', 0.95, 'ymd'),
# Fallback to DD. MM. YYYY
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})\s+\d{2}:\d{2}', 0.92, 'dmy'),
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})', 0.85, 'dmy'),
]
def extract_tva_entries(self, text: str) -> List[dict]:
"""
Extract OMV-specific TVA entries.
OMV receipts often show TVA in table format with base and TVA amounts.
Falls back to standard extraction if table format not found.
Args:
text: Raw OCR text from receipt
Returns:
List of TVA entries with code, percent, and amount
"""
entries = []
seen = set()
# Try table format first (more accurate)
for match in re.finditer(self.TVA_TABLE_PATTERN, text, re.IGNORECASE):
try:
code = match.group(1).upper()
percent = int(match.group(2))
# TVA amount is the second number (smaller one)
tva_amount = self._parse_decimal(match.group(4))
if tva_amount and tva_amount > 0:
entry_key = (code, percent)
if entry_key not in seen:
entries.append({
'code': code,
'percent': percent,
'amount': tva_amount
})
seen.add(entry_key)
except (ValueError, InvalidOperation):
continue
return entries
def get_validation_hints(self) -> Dict[str, Any]:
"""Return OMV-specific validation hints."""
return {
"has_multi_rate_tva": False,
"card_equals_total": False,
"has_client_cui": True,
"has_efactura": False,
"is_non_vat_payer": False,
"tva_table_format": True,
}

View File

@@ -0,0 +1,101 @@
"""
PICTUS VELUM SRL store profile for OCR extraction.
Office supplies and stationery store.
"""
import re
from decimal import Decimal, InvalidOperation
from typing import List, Dict, Any
from .base import BaseStoreProfile
from . import ProfileRegistry
@ProfileRegistry.register
class PictusVelumProfile(BaseStoreProfile):
"""
PICTUS VELUM SRL - standard TVA profile.
Key characteristics:
- Standard TVA format (single rate, any percentage)
- Office supplies and stationery (rechizite)
- CARD payment typical
"""
CUI_LIST = ["39634534"]
NAME_PATTERNS = ["PICTUS", "PICTUS VELUM", "P1CTUS", "PICTUS VELUM SRL"]
STORE_NAME = "PICTUS VELUM SRL"
# Standard TVA patterns (flexible - accepts any rate)
TVA_PATTERNS = [
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
# "A - XX,XX% = YY,YY"
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
# "TVA XX% YY,YY" (simple format without code)
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
]
def extract_tva_entries(self, text: str) -> List[dict]:
"""
Extract TVA entries from receipt text.
Args:
text: Raw OCR text from receipt
Returns:
List of TVA entries with code, percent, and amount
"""
entries = []
seen = set()
# Try coded patterns first
for pattern in self.TVA_PATTERNS[:2]:
for match in re.finditer(pattern, text, re.IGNORECASE):
try:
code = match.group(1).upper()
percent = int(match.group(2))
amount = self._parse_decimal(match.group(3))
if amount and amount > 0:
entry_key = (code, percent)
if entry_key not in seen:
entries.append({
'code': code,
'percent': percent,
'amount': amount
})
seen.add(entry_key)
except (ValueError, InvalidOperation, IndexError):
continue
# Fallback to simple format
if not entries:
simple_pattern = self.TVA_PATTERNS[2]
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
try:
percent = int(match.group(1))
amount = self._parse_decimal(match.group(2))
if amount and amount > 0:
entries.append({
'code': 'A',
'percent': percent,
'amount': amount
})
break
except (ValueError, InvalidOperation):
continue
return entries
def get_validation_hints(self) -> Dict[str, Any]:
"""Return PICTUS VELUM-specific validation hints."""
return {
"has_multi_rate_tva": False,
"card_equals_total": True,
"has_client_cui": False,
"has_efactura": False,
"is_non_vat_payer": False,
}

View File

@@ -0,0 +1,111 @@
"""
SOCAR Petroleum store profile for OCR extraction.
SOCAR receipts are similar to OMV - gas station with client CUI support.
Date format may use YYYY. MM. DD with spaces.
"""
import re
from datetime import date
from decimal import Decimal, InvalidOperation
from typing import List, Dict, Any, Tuple, Optional
from .base import BaseStoreProfile
from . import ProfileRegistry
@ProfileRegistry.register
class SocarProfile(BaseStoreProfile):
"""
SOCAR PETROLEUM S.A. - standard TVA with client CUI.
Key characteristics:
- Standard TVA format (usually single rate)
- Includes client CUI on receipt (for business purchases)
- Similar format to OMV/Petrom
- Date format may use YYYY. MM. DD (with spaces)
"""
CUI_LIST = ["12546600"]
NAME_PATTERNS = ["SOCAR", "S0CAR", "SOCAR PETROLEUM"] # OCR variants
STORE_NAME = "SOCAR PETROLEUM S.A."
# Standard TVA patterns for gas stations
TVA_PATTERNS = [
# Table format: "A-19,00% 285,66 49,58"
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]\d{2}\s*%\s+([\d.,]+)\s+([\d.,]+)',
# Simple format: "TVA 19% 49,58"
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
]
# Gas stations may use YYYY. MM. DD format
DATE_PATTERNS_OCR_SPACES = [
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})\s+\d{2}:\d{2}', 0.98, 'ymd'),
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})', 0.95, 'ymd'),
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})\s+\d{2}:\d{2}', 0.92, 'dmy'),
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})', 0.85, 'dmy'),
]
def extract_tva_entries(self, text: str) -> List[dict]:
"""
Extract SOCAR-specific TVA entries.
Args:
text: Raw OCR text from receipt
Returns:
List of TVA entries with code, percent, and amount
"""
entries = []
seen = set()
# Try table format first
table_pattern = self.TVA_PATTERNS[0]
for match in re.finditer(table_pattern, text, re.IGNORECASE):
try:
code = match.group(1).upper()
percent = int(match.group(2))
tva_amount = self._parse_decimal(match.group(4))
if tva_amount and tva_amount > 0:
entry_key = (code, percent)
if entry_key not in seen:
entries.append({
'code': code,
'percent': percent,
'amount': tva_amount
})
seen.add(entry_key)
except (ValueError, InvalidOperation):
continue
# Fallback to simple format if no table entries found
if not entries:
simple_pattern = self.TVA_PATTERNS[1]
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
try:
percent = int(match.group(1))
amount = self._parse_decimal(match.group(2))
if amount and amount > 0:
# Default to code 'A' for simple format
entries.append({
'code': 'A',
'percent': percent,
'amount': amount
})
break # Only take first match for simple format
except (ValueError, InvalidOperation):
continue
return entries
def get_validation_hints(self) -> Dict[str, Any]:
"""Return SOCAR-specific validation hints."""
return {
"has_multi_rate_tva": False,
"card_equals_total": False,
"has_client_cui": True,
"has_efactura": False,
"is_non_vat_payer": False,
}

View File

@@ -0,0 +1,112 @@
"""
STEPOUT MARKET SRL store profile for OCR extraction.
Bookstore with reduced TVA rate (5% for books in Romania).
"""
import re
from decimal import Decimal, InvalidOperation
from typing import List, Dict, Any
from .base import BaseStoreProfile
from . import ProfileRegistry
@ProfileRegistry.register
class StepoutMarketProfile(BaseStoreProfile):
"""
STEPOUT MARKET SRL - reduced TVA rate profile (books).
Key characteristics:
- Reduced TVA rate: 5% for books (cărți qualification in Romania)
- May also have standard rates for non-book items
- Patterns are flexible to accept ANY TVA rate
- CARD payment typical
"""
CUI_LIST = ["35532655"]
NAME_PATTERNS = ["STEPOUT", "STEPOUT MARKET", "STEP0UT", "STEPOUT MARKET SRL"]
STORE_NAME = "STEPOUT MARKET SRL"
# TVA patterns (flexible - accepts any rate including 5%)
TVA_PATTERNS = [
# "TVA A: 5% = YY,YY" or "TVA-A 5% YY,YY" (coded format)
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
# "A - 5,00% = YY,YY" (table format)
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
# "TVA 5% YY,YY" (simple format - common for single rate)
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
# "TVA 5,00%: YY,YY" (percent with colon)
r'TVA\s+(\d{1,2})[.,]\d{2}\s*%\s*:?\s*([\d.,]+)',
]
def extract_tva_entries(self, text: str) -> List[dict]:
"""
Extract TVA entries from receipt text.
Stepout Market primarily sells books which have 5% TVA in Romania.
The patterns are generic and will extract whatever rate is on the receipt.
Args:
text: Raw OCR text from receipt
Returns:
List of TVA entries with code, percent, and amount
"""
entries = []
seen = set()
# Try coded patterns first (have code letter)
for pattern in self.TVA_PATTERNS[:2]:
for match in re.finditer(pattern, text, re.IGNORECASE):
try:
code = match.group(1).upper()
percent = int(match.group(2))
amount = self._parse_decimal(match.group(3))
if amount and amount > 0:
entry_key = (code, percent)
if entry_key not in seen:
entries.append({
'code': code,
'percent': percent,
'amount': amount
})
seen.add(entry_key)
except (ValueError, InvalidOperation, IndexError):
continue
# Fallback to simple format (no code letter, just percent + amount)
if not entries:
for pattern in self.TVA_PATTERNS[2:]:
for match in re.finditer(pattern, text, re.IGNORECASE):
try:
percent = int(match.group(1))
amount = self._parse_decimal(match.group(2))
if amount and amount > 0:
# Default to code 'A' for simple format
entries.append({
'code': 'A',
'percent': percent,
'amount': amount
})
break # Only take first match for simple format
except (ValueError, InvalidOperation):
continue
if entries:
break
return entries
def get_validation_hints(self) -> Dict[str, Any]:
"""Return STEPOUT MARKET-specific validation hints."""
return {
"has_multi_rate_tva": False,
"card_equals_total": True,
"has_client_cui": True, # May have client CUI
"has_efactura": False,
"is_non_vat_payer": False,
"typical_tva_rate": 5, # Books have 5% TVA in Romania
"product_category": "books",
}

View File

@@ -0,0 +1,103 @@
"""
UNLIMITED KEYS S.R.L. store profile for OCR extraction.
Key duplication service. Notable for CASH (NUMERAR) payments.
"""
import re
from decimal import Decimal, InvalidOperation
from typing import List, Dict, Any
from .base import BaseStoreProfile
from . import ProfileRegistry
@ProfileRegistry.register
class UnlimitedKeysProfile(BaseStoreProfile):
"""
UNLIMITED KEYS S.R.L. - standard TVA profile with NUMERAR payment.
Key characteristics:
- Standard TVA format (single rate, any percentage)
- Key duplication service
- NUMERAR (cash) payment common - different from most stores!
- May also accept CARD
"""
CUI_LIST = ["18993187"]
NAME_PATTERNS = ["UNLIMITED KEYS", "UNLIMITED", "UNL1MITED", "UNLIMITED KEYS SRL"]
STORE_NAME = "UNLIMITED KEYS S.R.L."
# Standard TVA patterns (flexible - accepts any rate)
TVA_PATTERNS = [
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
# "A - XX,XX% = YY,YY"
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
# "TVA XX% YY,YY" (simple format without code)
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
]
def extract_tva_entries(self, text: str) -> List[dict]:
"""
Extract TVA entries from receipt text.
Args:
text: Raw OCR text from receipt
Returns:
List of TVA entries with code, percent, and amount
"""
entries = []
seen = set()
# Try coded patterns first
for pattern in self.TVA_PATTERNS[:2]:
for match in re.finditer(pattern, text, re.IGNORECASE):
try:
code = match.group(1).upper()
percent = int(match.group(2))
amount = self._parse_decimal(match.group(3))
if amount and amount > 0:
entry_key = (code, percent)
if entry_key not in seen:
entries.append({
'code': code,
'percent': percent,
'amount': amount
})
seen.add(entry_key)
except (ValueError, InvalidOperation, IndexError):
continue
# Fallback to simple format
if not entries:
simple_pattern = self.TVA_PATTERNS[2]
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
try:
percent = int(match.group(1))
amount = self._parse_decimal(match.group(2))
if amount and amount > 0:
entries.append({
'code': 'A',
'percent': percent,
'amount': amount
})
break
except (ValueError, InvalidOperation):
continue
return entries
def get_validation_hints(self) -> Dict[str, Any]:
"""Return UNLIMITED KEYS-specific validation hints."""
return {
"has_multi_rate_tva": False,
"card_equals_total": False, # May be NUMERAR (cash)
"has_client_cui": True, # May have client CUI
"has_efactura": False,
"is_non_vat_payer": False,
"common_payment": "NUMERAR", # Cash payments common
}

View File

@@ -7,6 +7,7 @@ from typing import Optional, Tuple, List
from dataclasses import dataclass, field
from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine
from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
@dataclass
@@ -63,6 +64,57 @@ class ExtractionResult:
class ReceiptExtractor:
"""Extract receipt fields using pattern matching for Romanian receipts."""
# =========================================================================
# DEPRECATED: STORE_PROFILES dict - USE ProfileRegistry INSTEAD
# =========================================================================
# Store profiles are now managed by ProfileRegistry in:
# backend/modules/data_entry/services/ocr/profiles/
#
# This dict is kept for reference only. All extraction logic now uses:
# ProfileRegistry.get_profile(cui)
#
# See: backend/modules/data_entry/services/ocr/profiles/README.md
# =========================================================================
STORE_PROFILES = {
# Lidl - multi-rate TVA (A+B), specific format without hyphen/colon
"22891860": {
"name": "LIDL DISCOUNT S.R.L.",
"tva_pattern": "lidl",
"tva_format": "TVA {code} {percent}% {amount}",
"has_multi_rate_tva": True,
"card_equals_total": True,
},
# OMV Petrom - single TVA rate, client CUI included
"11201891": {
"name": "OMV PETROM MARKETING S.R.L.",
"tva_pattern": "standard",
"has_client_cui": True,
},
# FIVE-HOLDING (BRICK) - standard format
"10562600": {
"name": "FIVE-HOLDING S.A.",
"tva_pattern": "standard",
},
# Dedeman - e-factura format
"2816464": {
"name": "DEDEMAN SRL",
"tva_pattern": "standard",
"has_efactura": True,
},
# SOCAR Petroleum
"12546600": {
"name": "SOCAR PETROLEUM S.A.",
"tva_pattern": "standard",
"has_client_cui": True,
},
# Kineterra - non-VAT payer
"31180432": {
"name": "KINETERRA CONCEPT SRL",
"tva_pattern": "none",
"is_non_vat_payer": True,
},
}
# Total amount patterns (most specific first)
# Romanian receipts use various formats: TOTAL LEI, TOTAL:, TOTAL RON, etc.
# OCR often produces errors, so patterns must be tolerant
@@ -394,48 +446,101 @@ class ReceiptExtractor:
result.raw_text = text
text_upper = text.upper()
# Extract core fields
result.amount, result.confidence_amount = self._extract_amount(text_upper)
result.receipt_date, result.confidence_date = self._extract_date(text_upper)
result.receipt_number, _ = self._extract_number(text_upper)
result.receipt_series, _ = self._extract_series(text_upper)
# =========================================================================
# STEP 1: Extract vendor info FIRST to find store profile
# =========================================================================
result.partner_name, result.confidence_vendor = self._extract_vendor(text)
result.cui, _ = self._extract_cui(text_upper, text)
# Normalize CUI: fix R0 → RO OCR error and validate format
result.cui = OCRValidationEngine.normalize_cui(result.cui)
# Extract additional fields - Multiple TVA entries
result.tva_entries, result.tva_total = self._extract_tva_entries(text_upper)
# Lookup store-specific profile for enhanced extraction accuracy
store_profile = ProfileRegistry.get_profile(result.cui) if result.cui else None
if store_profile:
print(f"[Profile] Using {store_profile.__class__.__name__} for CUI {result.cui}", flush=True)
# =========================================================================
# STEP 2: Extract ALL fields using profile (if available) or generic
# =========================================================================
if store_profile:
# Profile-specific extraction (higher accuracy for known stores)
result.amount, result.confidence_amount = store_profile.extract_total(text_upper)
result.receipt_date, result.confidence_date = store_profile.extract_date(text_upper)
result.receipt_number, _ = store_profile.extract_receipt_number(text_upper)
result.tva_entries = store_profile.extract_tva_entries(text_upper)
result.tva_total = sum(e['amount'] for e in result.tva_entries) if result.tva_entries else None
result.payment_methods = store_profile.extract_payment_methods(text_upper)
# Client data extraction via profile (CUI + name)
profile_client_cui, cui_confidence = store_profile.extract_client_cui(text_upper)
profile_client_name, name_confidence = store_profile.extract_client_name(text)
if profile_client_cui or profile_client_name:
# Use profile extraction results
result.client_cui = OCRValidationEngine.normalize_cui(profile_client_cui) if profile_client_cui else None
result.client_name = profile_client_name
result.confidence_client = max(cui_confidence, name_confidence)
# Address still via generic (no profile method)
_, _, client_address, _ = self._extract_client_data(text_upper, text)
result.client_address = client_address
else:
# Fallback to generic client extraction
client_name, client_cui, client_address, confidence = self._extract_client_data(text_upper, text)
result.client_name = client_name
result.client_cui = OCRValidationEngine.normalize_cui(client_cui)
result.client_address = client_address
result.confidence_client = confidence
print(f"[Profile] Extracted: total={result.amount}, date={result.receipt_date}, "
f"TVA entries={len(result.tva_entries)}, payments={len(result.payment_methods)}", flush=True)
else:
# Generic extraction for unknown stores
result.amount, result.confidence_amount = self._extract_amount(text_upper)
result.receipt_date, result.confidence_date = self._extract_date(text_upper)
result.receipt_number, _ = self._extract_number(text_upper)
result.tva_entries, result.tva_total = self._extract_tva_entries(text_upper)
result.payment_methods = self._extract_payment_methods(text_upper)
# Generic client extraction
client_name, client_cui, client_address, confidence = self._extract_client_data(text_upper, text)
result.client_name = client_name
result.client_cui = OCRValidationEngine.normalize_cui(client_cui)
result.client_address = client_address
result.confidence_client = confidence
# Series extraction (no profile method, always generic)
result.receipt_series, _ = self._extract_series(text_upper)
# =========================================================================
# STEP 3: Debug logging and validation
# =========================================================================
if not result.tva_entries:
print(f"[TVA Debug] No TVA found. Checking patterns...", flush=True)
# Debug: show what patterns see
normalized = re.sub(r'(\d+)[.,]\s+(\d{2})', r'\1.\2', text_upper)
taxe_match = re.search(r'T?OTAL\s+TAXE', normalized, re.IGNORECASE)
rev_match = re.search(r'([\d.,]+)\s*T?OTAL\s+TAXE', normalized, re.IGNORECASE)
print(f"[TVA Debug] 'OTAL TAXE' found: {bool(taxe_match)}, reversed: {rev_match.group(1) if rev_match else None}", flush=True)
# Log TVA vs TOTAL for debugging (validation happens in ocr_service._final_validation)
# NOTE: We NO LONGER clear TVA here - the service will recalculate TOTAL from TVA if needed
# Log TVA vs TOTAL for debugging
if result.tva_total and result.amount:
if result.tva_total > result.amount:
print(f"[TVA Extraction] TVA ({result.tva_total}) > TOTAL ({result.amount}) - will be corrected in final validation", flush=True)
elif result.tva_total > result.amount * Decimal('0.5'):
print(f"[TVA Extraction] Warning: TVA ({result.tva_total}) is > 50% of TOTAL ({result.amount}) - suspicious", flush=True)
# Additional generic extractions
result.items_count = self._extract_items_count(text_upper)
result.address = self._extract_address(text_upper)
result.payment_methods = self._extract_payment_methods(text_upper)
# Validate payment methods against extracted amount
# If payment sum >> amount, clear invalid payments (likely OCR error)
# =========================================================================
# STEP 4: Validate and post-process
# =========================================================================
# Save original payment methods before validation (for payment mode detection)
original_payment_methods = result.payment_methods.copy() if result.payment_methods else []
# Validate payment methods against extracted amount
result.payment_methods = self._validate_payment_methods(result.payment_methods, result.amount)
# Auto-suggest payment_mode based on detected payment methods
# Use ORIGINAL payment_methods to detect CARD even if validation cleared them
# (e.g., CARD 318.16 is valid even if total validation failed)
payment_methods_for_mode = result.payment_methods if result.payment_methods else original_payment_methods
if payment_methods_for_mode:
card_amount = sum(
@@ -447,17 +552,9 @@ class ReceiptExtractor:
result.suggested_payment_mode = 'banca'
print(f"[Payment Mode] CARD detected ({card_amount}), suggesting 'banca'", flush=True)
else:
# Only cash payments detected
result.suggested_payment_mode = 'numerar'
print(f"[Payment Mode] Cash only detected, suggesting 'numerar'", flush=True)
# Extract client data (B2B receipts)
client_name, client_cui, client_address, confidence_client = self._extract_client_data(text_upper, text)
result.client_name = client_name
result.client_cui = OCRValidationEngine.normalize_cui(client_cui) # Fix R0 → RO OCR error
result.client_address = client_address
result.confidence_client = confidence_client
# Detect receipt type
result.receipt_type = self._detect_receipt_type(text_upper)
@@ -620,6 +717,40 @@ class ReceiptExtractor:
return num_str
def _calculate_multi_rate_tva_total(self, tva_entries: List[dict]) -> Optional[Decimal]:
"""
Calculate implied total from ALL TVA entries (multi-rate support).
Formula for each entry: total_for_entry = tva * (100 + rate) / rate
Final total = sum of all entry totals
Example for Lidl (TVA A 21% = 7.71, TVA B 11% = 2.13):
Entry A: 7.71 * 121 / 21 = 44.45
Entry B: 2.13 * 111 / 11 = 21.49
Total: 44.45 + 21.49 = 65.94 ≈ 65.86 (within tolerance)
Returns:
Implied total Decimal, or None if calculation not possible
"""
if not tva_entries:
return None
total = Decimal('0')
for entry in tva_entries:
rate = entry.get('percent', 0)
tva_amount = entry.get('amount')
if tva_amount and rate > 0:
try:
tva_dec = Decimal(str(tva_amount))
# Formula: total_for_entry = tva * (100 + rate) / rate
entry_total = tva_dec * Decimal(100 + rate) / Decimal(rate)
total += entry_total
print(f"[Multi-rate TVA] Entry {entry.get('code', '?')}: tva={tva_amount}, rate={rate}% -> implied={entry_total:.2f}", flush=True)
except (InvalidOperation, ValueError, TypeError):
continue
return total.quantize(Decimal('0.01')) if total > 0 else None
def _cross_validate_and_calculate_amount(
self,
amount: Optional[Decimal],
@@ -634,12 +765,11 @@ class ReceiptExtractor:
Returns: (amount, confidence, source_description)
Logic:
1. If amount is valid (>0) with high confidence (>=0.8), use it directly
2. Calculate payment_sum = CARD + NUMERAR + other methods
3. Calculate tva_implied_total = tva_total * (100 + rate) / rate
4. Cross-validate: if payment_sum matches extracted amount, boost confidence
5. If amount is 0/None, use payment_sum as total
6. If payment_sum is 0, try to calculate from TVA
1. Collect all available sources: extracted amount, payment sum, TVA-implied total
2. Find consensus: 2+ sources within 3% tolerance
3. If consensus found, use the higher-confidence source value
4. If extracted differs >10% from all others, it's an outlier - correct it
5. If no consensus possible, fallback to individual validations
"""
# Calculate payment methods sum
payment_sum = Decimal('0')
@@ -652,43 +782,73 @@ class ReceiptExtractor:
except (InvalidOperation, ValueError, TypeError):
continue
# Calculate TVA-implied total: total = tva * (100 + rate) / rate
tva_implied_total = None
if tva_entries:
# Use the main TVA entry (typically the largest or first one)
main_entry = tva_entries[0]
rate = main_entry.get('percent', 19)
tva_amount = main_entry.get('amount')
if tva_amount and rate > 0:
try:
tva_dec = Decimal(str(tva_amount))
# total = tva * (100 + rate) / rate
tva_implied_total = (tva_dec * Decimal(100 + rate) / Decimal(rate)).quantize(Decimal('0.01'))
except (InvalidOperation, ValueError, TypeError):
pass
# Calculate TVA-implied total using ALL entries (multi-rate fix)
tva_implied_total = self._calculate_multi_rate_tva_total(tva_entries)
# Case 1: Amount is valid with high confidence - validate against TVA and payments
# Multi-source consensus approach (3% tolerance for multi-rate TVA rounding)
CONSENSUS_TOLERANCE = 3.0 # 3% tolerance
# Collect all available sources with their confidences
sources = []
if amount and amount > 0:
sources.append(('extracted', float(amount), confidence_amount))
if payment_sum > 0:
sources.append(('payment', float(payment_sum), 0.92)) # Payment is very reliable
if tva_implied_total and tva_implied_total > 0:
sources.append(('tva_calc', float(tva_implied_total), 0.88)) # TVA calc is reliable
print(f"[Cross-Validation] Sources: {[(s[0], f'{s[1]:.2f}', f'{s[2]:.2f}') for s in sources]}", flush=True)
# Find consensus: 2+ sources within tolerance
if len(sources) >= 2:
for i, (name1, val1, conf1) in enumerate(sources):
for name2, val2, conf2 in sources[i+1:]:
if val1 <= 0 or val2 <= 0:
continue
diff_pct = abs(val1 - val2) / max(val1, val2) * 100
if diff_pct <= CONSENSUS_TOLERANCE:
# Consensus found! Use value from higher-confidence source
if conf1 >= conf2:
consensus_val, consensus_conf = val1, conf1
else:
consensus_val, consensus_conf = val2, conf2
# Boost confidence for consensus
consensus_conf = min(0.98, consensus_conf + 0.05)
print(f"[Cross-Validation] Consensus: {name1}={val1:.2f}{name2}={val2:.2f} (diff={diff_pct:.1f}%)", flush=True)
return Decimal(str(round(consensus_val, 2))), consensus_conf, f"consensus ({name1}+{name2})"
# No consensus - check if extracted is an outlier (differs >10% from all others)
if amount and amount > 0 and len(sources) >= 2:
other_sources = [s for s in sources if s[0] != 'extracted']
if other_sources:
extracted_val = float(amount)
all_differ = all(
abs(extracted_val - s[1]) / max(extracted_val, s[1]) * 100 > 10
for s in other_sources if s[1] > 0
)
if all_differ:
# Extracted differs significantly from all others - use the best other source
best_other = max(other_sources, key=lambda s: s[2])
print(f"[Cross-Validation] Extracted outlier: {extracted_val:.2f} differs >10% from all others, using {best_other[0]}={best_other[1]:.2f}", flush=True)
return Decimal(str(round(best_other[1], 2))), best_other[2], f"corrected (extracted outlier, using {best_other[0]})"
# Fallback: Case 1 - Amount valid with high confidence
if amount and amount > 0 and confidence_amount >= 0.8:
# First check TVA-implied total (most reliable when TVA is extracted correctly)
# Check TVA-implied total
if tva_implied_total and tva_implied_total > 0:
tva_diff_percent = abs(float(amount) - float(tva_implied_total)) / float(tva_implied_total) * 100
if tva_diff_percent <= 1:
# Near-perfect TVA match - highest confidence
if tva_diff_percent <= 3:
return amount, min(0.98, confidence_amount + 0.05), "extracted (validated by TVA)"
elif tva_diff_percent > 10:
# Significant mismatch - TVA-implied total is more reliable
# This catches cases where wrong TOTAL line was extracted (e.g., REST, SUBTOTAL)
print(f"[Cross-Validation] Amount mismatch with TVA: extracted={amount}, tva_implied={tva_implied_total} (diff={tva_diff_percent:.1f}%)", flush=True)
return tva_implied_total, 0.90, "calculated from TVA (extracted amount mismatch)"
# Cross-validate with payment methods
if payment_sum > 0 and abs(amount - payment_sum) <= Decimal('0.02'):
# Perfect match - boost confidence
return amount, min(0.98, confidence_amount + 0.05), "extracted (validated by payment methods)"
elif payment_sum > 0:
payment_diff_percent = abs(float(amount) - float(payment_sum)) / float(payment_sum) * 100
if payment_diff_percent > 10:
# Significant mismatch - payment sum is more reliable
print(f"[Cross-Validation] Amount mismatch with payments: extracted={amount}, payments={payment_sum} (diff={payment_diff_percent:.1f}%)", flush=True)
return payment_sum, 0.88, "calculated from payment methods (extracted amount mismatch)"
@@ -696,29 +856,22 @@ class ReceiptExtractor:
# Case 2: Amount exists but low confidence - try to validate/correct
if amount and amount > 0:
# First check TVA-implied total (most reliable)
if tva_implied_total and tva_implied_total > 0:
tva_diff_percent = abs(float(amount) - float(tva_implied_total)) / float(tva_implied_total) * 100
if tva_diff_percent <= 2:
# Close match - boost confidence
if tva_diff_percent <= 3:
return amount, 0.88, "extracted (validated by TVA)"
elif tva_diff_percent > 10:
# Significant mismatch - use TVA-implied total
print(f"[Cross-Validation] Amount mismatch with TVA: extracted={amount}, tva_implied={tva_implied_total} (diff={tva_diff_percent:.1f}%)", flush=True)
return tva_implied_total, 0.85, "calculated from TVA"
# Check if payment methods sum matches
if payment_sum > 0:
payment_diff_percent = abs(float(amount) - float(payment_sum)) / float(payment_sum) * 100
if payment_diff_percent <= 0.5:
# Close match - boost confidence
if payment_diff_percent <= 1:
return amount, 0.90, "extracted (validated by payment methods)"
elif payment_diff_percent > 10:
# Mismatch - prefer payment_sum as it's more reliable
print(f"[Cross-Validation] Amount mismatch: extracted={amount}, payments={payment_sum}", flush=True)
return payment_sum, 0.85, "calculated from payment methods"
# No validation possible - return as-is
return amount, confidence_amount, "extracted (unvalidated)"
# Case 3: Amount is 0 or None - calculate from payment methods
@@ -946,6 +1099,28 @@ class ReceiptExtractor:
return name
def _get_store_profile(self, cui: Optional[str]) -> Optional[dict]:
"""
Get store-specific profile by CUI.
DEPRECATED: Use ProfileRegistry.get_profile() directly for profile objects.
This method is kept for backward compatibility and returns validation hints dict.
Args:
cui: The CUI extracted from receipt (with or without RO prefix)
Returns:
Store profile validation hints dict or None if not found
"""
profile = ProfileRegistry.get_profile(cui)
if profile:
# Return validation hints for backward compatibility
hints = profile.get_validation_hints()
hints['name'] = profile.STORE_NAME
print(f"[Store Profile] Found profile for {cui}: {profile.STORE_NAME}", flush=True)
return hints
return None
def _extract_cui(self, text_upper: str, original_text: str) -> Tuple[Optional[str], float]:
"""
Extract vendor CUI (fiscal identification code) from text.
@@ -1020,11 +1195,114 @@ class ReceiptExtractor:
# Default to bon_fiscal if neither found
return 'bon_fiscal'
def _try_pattern_lidl(self, text: str) -> List[dict]:
"""
Try Lidl-style TVA pattern: "TVA A 21,00% 7.71" (no hyphen/colon separator).
Lidl receipts format:
TOTAL TVA 9,84
TVA A 21,00% 7,71
TVA B 11,00% 2,13
Returns list of TVA entries found.
"""
entries = []
seen = set()
# Pattern: TVA/TUA/IVA + code (A-D) + percent + amount (on same line)
# Handles: "TVA A 21,00% 7,71", "TVA B 11,00% 2,13", "TUA A 21% 7.71"
lidl_patterns = [
# Same line: "TVA A 21,00% 7.71" (with various spacing)
r'T[VU][AR]\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
# Same line with backslash (OCR artifact): "TVA A \21,00% 7.71"
r'T[VU][AR]\s+([A-D])\s+\\?(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
# IVA variant
r'IVA\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
]
for pattern in lidl_patterns:
for match in re.finditer(pattern, text, re.IGNORECASE):
try:
code = match.group(1).upper()
percent = int(match.group(2))
amount_str = self._normalize_number(match.group(3))
amount = Decimal(amount_str)
if amount > 0:
entry_key = (code, percent)
if entry_key not in seen:
entries.append({
'code': code,
'percent': percent,
'amount': amount
})
seen.add(entry_key)
print(f"[TVA Lidl] Found: TVA {code} {percent}% = {amount}", flush=True)
except (ValueError, InvalidOperation):
continue
return entries
def _select_best_tva_candidate(
self,
candidates: List[tuple],
tva_bon_total: Optional[Decimal]
) -> Tuple[List[dict], Optional[Decimal]]:
"""
Select the best TVA candidate from collected candidates.
Selection criteria (priority order):
1. Sum matches TOTAL TVA BON (highest priority)
2. More entries = better (for multi-rate receipts)
3. Pattern confidence as tiebreaker
Args:
candidates: List of (pattern_name, confidence, entries, sum)
tva_bon_total: Authoritative TOTAL TVA BON value (if extracted)
Returns:
(best_entries, best_sum)
"""
if not candidates:
return [], None
# Score each candidate
scored = []
for name, confidence, entries, sum_val in candidates:
score = 0.0
# Criterion 1: Sum matches TOTAL TVA BON (highest priority)
if tva_bon_total and sum_val:
tolerance = max(Decimal('0.02'), tva_bon_total * Decimal('0.02')) # 2% tolerance
if abs(sum_val - tva_bon_total) <= tolerance:
score += 100 # High bonus for matching authoritative total
print(f"[TVA Select] {name}: sum {sum_val} matches tva_bon_total {tva_bon_total}", flush=True)
# Criterion 2: More entries (for multi-rate receipts)
score += len(entries) * 10
# Criterion 3: Pattern confidence
score += confidence * 5
scored.append((score, name, confidence, entries, sum_val))
print(f"[TVA Select] Candidate {name}: score={score:.1f}, entries={len(entries)}, sum={sum_val}", flush=True)
# Sort by score descending
scored.sort(key=lambda x: x[0], reverse=True)
best = scored[0]
print(f"[TVA Select] Winner: {best[1]} (score={best[0]:.1f})", flush=True)
return best[3], best[4]
def _extract_tva_entries(self, text: str) -> Tuple[List[dict], Optional[Decimal]]:
"""
Extract multiple TVA (VAT) entries from text.
Romanian receipts can have multiple TVA rates (A=19%, B=9%, C=5%, D=0%).
Uses CANDIDATE COLLECTION approach:
- Try ALL patterns and collect candidates
- Select best candidate based on matching TOTAL TVA BON
Returns (tva_entries, tva_total) where tva_entries is a list of:
{'code': 'A', 'percent': 19, 'amount': Decimal('15.20')}
"""
@@ -1054,6 +1332,22 @@ class ReceiptExtractor:
# Also normalize comma followed by space to comma (for "21, 00%" -> "21,00%")
normalized_text = re.sub(r'(\d+),\s+(\d{2})\s*%', r'\1.\2%', normalized_text)
# Extract TOTAL TVA BON/TOTAL TVA first as the authoritative reference
tva_bon_total = self._extract_total_tva_bon(normalized_text)
print(f"[TVA Debug] TOTAL TVA BON: {tva_bon_total}", flush=True)
# CANDIDATE COLLECTION APPROACH: Try all patterns, collect candidates, select best
all_candidates = [] # List of (pattern_name, confidence, entries, sum)
# === LIDL-STYLE PATTERNS (NEW) ===
# Lidl format: "TVA A 21,00% 7.71" or "TVA B 11,00% 2.13" (no hyphen/colon)
# This pattern handles multi-rate TVA receipts
lidl_entries = self._try_pattern_lidl(normalized_text)
if lidl_entries:
lidl_sum = sum(e['amount'] for e in lidl_entries)
all_candidates.append(('lidl', 0.96, lidl_entries, lidl_sum))
print(f"[TVA Debug] Lidl pattern: {len(lidl_entries)} entries, sum={lidl_sum}", flush=True)
# Pattern 0a: First try to get TVA from "TOTAL TAXE:" which is most reliable
# Format: "TOTAL TAXE: 55,22" - this is always the TVA amount
# OCR may cut "T" producing "OTAL TAXE:" instead of "TOTAL TAXE:"
@@ -1372,10 +1666,21 @@ class ReceiptExtractor:
except (ValueError, InvalidOperation):
continue
# Extract TOTAL TVA BON as reference (separate from individual entries)
tva_bon_total = self._extract_total_tva_bon(normalized_text)
# Add existing extraction results to candidates (if any)
if tva_entries:
entries_sum = sum(entry['amount'] for entry in tva_entries)
all_candidates.append(('standard', 0.90, tva_entries, entries_sum))
print(f"[TVA Debug] Standard patterns: {len(tva_entries)} entries, sum={entries_sum}", flush=True)
# Calculate sum from entries
# === CANDIDATE SELECTION ===
# Select best candidate using TOTAL TVA BON as authoritative reference
if all_candidates:
best_entries, best_sum = self._select_best_tva_candidate(all_candidates, tva_bon_total)
if best_entries:
tva_entries = best_entries
entries_sum = best_sum
# Calculate sum from entries (if not set by candidate selection)
entries_sum = None
if tva_entries:
entries_sum = sum(entry['amount'] for entry in tva_entries)

600
scripts/generate_store_profile.py Executable file
View File

@@ -0,0 +1,600 @@
#!/usr/bin/env python3
"""
Store Profile Generator Script
Analyzes PDF receipts from a store and generates a Python profile class
for the OCR extraction system.
Usage:
python scripts/generate_store_profile.py \
--name "Magazin Exemplu" \
--cui "12345678" \
--receipts "docs/data-entry/MagazinExemplu*.pdf" \
--output "backend/modules/data_entry/services/ocr/profiles/magazin_exemplu.py"
Features:
- Submits PDFs to OCR API
- Analyzes extracted text for patterns (TVA, total, date, payment)
- Generates a BaseStoreProfile subclass with detected patterns
- Supports hot-reload via ProfileRegistry
Requirements:
- Backend server running on localhost:8000
- JWT authentication
- python-jose, requests packages
"""
import argparse
import glob
import json
import os
import re
import sys
import time
from collections import Counter, defaultdict
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Dict, List, Optional, Tuple
try:
import requests
from jose import jwt
except ImportError:
print("Error: Required packages not installed.")
print("Run: pip install python-jose requests")
sys.exit(1)
# Configuration
API_BASE = os.getenv("API_BASE", "http://localhost:8000")
JWT_SECRET = os.getenv("JWT_SECRET_KEY", "GENERATE_NEW_SECRET_FOR_PRODUCTION3334!")
def create_jwt_token() -> str:
"""Create a test JWT token for API authentication."""
payload = {
"username": "PROFILE_GENERATOR",
"user_id": 1,
"companies": ["604"],
"permissions": ["read", "write"],
"exp": datetime.now(timezone.utc) + timedelta(hours=1),
"iat": datetime.now(timezone.utc),
"type": "access"
}
return jwt.encode(payload, JWT_SECRET, algorithm="HS256")
def submit_ocr(pdf_path: str, token: str, api_base: str = API_BASE, timeout: int = 120) -> Optional[Dict]:
"""
Submit a PDF to OCR API and wait for result.
Args:
pdf_path: Path to PDF file
token: JWT authentication token
api_base: API base URL
timeout: Max seconds to wait for completion
Returns:
Extraction result dict or None on failure
"""
headers = {"Authorization": f"Bearer {token}"}
filename = os.path.basename(pdf_path)
print(f" Submitting: {filename}...", end=" ", flush=True)
try:
with open(pdf_path, "rb") as f:
files = {"file": (filename, f, "application/pdf")}
response = requests.post(
f"{api_base}/api/data-entry/ocr/extract?engine=doctr_plus",
files=files,
headers=headers,
timeout=30
)
if response.status_code != 200:
print(f"FAILED (HTTP {response.status_code})")
return None
job_data = response.json()
job_id = job_data.get("job_id")
if not job_id:
print("FAILED (no job_id)")
return None
# Poll for completion
start_time = time.time()
while time.time() - start_time < timeout:
poll_response = requests.get(
f"{api_base}/api/data-entry/ocr/jobs/{job_id}/wait?timeout=30",
headers=headers,
timeout=35
)
if poll_response.status_code == 200:
job_result = poll_response.json()
status = job_result.get("status")
if status == "completed":
elapsed = time.time() - start_time
print(f"OK ({elapsed:.1f}s)")
return job_result.get("result", {})
elif status == "error":
print(f"ERROR: {job_result.get('error', 'Unknown')}")
return None
time.sleep(2)
print("TIMEOUT")
return None
except Exception as e:
print(f"EXCEPTION: {e}")
return None
def analyze_tva_patterns(results: List[Dict]) -> Dict:
"""
Analyze TVA patterns from multiple extraction results.
Returns:
Dict with detected patterns and statistics
"""
tva_entries = []
raw_texts = []
for r in results:
if r.get("tva_entries"):
tva_entries.extend(r["tva_entries"])
if r.get("raw_text"):
raw_texts.append(r["raw_text"])
# Analyze TVA code patterns (A, B, C, etc.)
codes = Counter(e.get("code") for e in tva_entries if e.get("code"))
# Analyze TVA percentage patterns
percents = Counter(e.get("percent") for e in tva_entries if e.get("percent"))
# Detect TVA format from raw text
tva_formats = defaultdict(int)
for text in raw_texts:
text_upper = text.upper()
# Standard format: "TVA 19% 10.50" or "TVA: 19% 10.50"
if re.search(r'TVA\s*:?\s*\d{1,2}%', text_upper):
tva_formats["standard"] += 1
# Lidl format: "TVA A 21% 7.71"
if re.search(r'TVA\s+[A-D]\s+\d{1,2}', text_upper):
tva_formats["lidl_multi_rate"] += 1
# Table format: "BAZA TVA | % TVA | VALOARE TVA"
if re.search(r'BAZA\s+TVA', text_upper):
tva_formats["table"] += 1
# No TVA (neplatitor)
if re.search(r'NEPLATITOR|NON.?TVA', text_upper):
tva_formats["non_vat"] += 1
return {
"codes": dict(codes),
"percents": dict(percents),
"formats": dict(tva_formats),
"has_multi_rate": len(codes) > 1,
"is_non_vat": tva_formats.get("non_vat", 0) > 0,
"dominant_format": max(tva_formats, key=tva_formats.get) if tva_formats else "standard"
}
def analyze_total_patterns(results: List[Dict]) -> Dict:
"""Analyze TOTAL patterns from extraction results."""
totals = []
raw_texts = []
for r in results:
if r.get("amount"):
totals.append(float(r["amount"]))
if r.get("raw_text"):
raw_texts.append(r["raw_text"])
total_formats = defaultdict(int)
for text in raw_texts:
text_upper = text.upper()
if re.search(r'TOTAL\s*:?\s*[\d.,]+', text_upper):
total_formats["TOTAL:"] += 1
if re.search(r'TOTAL\s+DE\s+PLAT', text_upper):
total_formats["TOTAL DE PLATA"] += 1
if re.search(r'SUMA\s+TOTAL', text_upper):
total_formats["SUMA TOTALA"] += 1
if re.search(r'GRAND\s*TOTAL', text_upper):
total_formats["GRAND TOTAL"] += 1
return {
"count": len(totals),
"formats": dict(total_formats),
"dominant_format": max(total_formats, key=total_formats.get) if total_formats else "TOTAL"
}
def analyze_date_patterns(results: List[Dict]) -> Dict:
"""Analyze date patterns from extraction results."""
dates = []
raw_texts = []
for r in results:
if r.get("receipt_date"):
dates.append(r["receipt_date"])
if r.get("raw_text"):
raw_texts.append(r["raw_text"])
date_formats = defaultdict(int)
for text in raw_texts:
# DD.MM.YYYY
if re.search(r'\d{2}\.\d{2}\.\d{4}', text):
date_formats["DD.MM.YYYY"] += 1
# YYYY.MM.DD (OMV/SOCAR style)
if re.search(r'\d{4}\.\d{2}\.\d{2}', text):
date_formats["YYYY.MM.DD"] += 1
# DD-MM-YYYY
if re.search(r'\d{2}-\d{2}-\d{4}', text):
date_formats["DD-MM-YYYY"] += 1
# DD/MM/YYYY
if re.search(r'\d{2}/\d{2}/\d{4}', text):
date_formats["DD/MM/YYYY"] += 1
return {
"extracted_dates": dates,
"formats": dict(date_formats),
"dominant_format": max(date_formats, key=date_formats.get) if date_formats else "DD.MM.YYYY"
}
def analyze_payment_patterns(results: List[Dict]) -> Dict:
"""Analyze payment method patterns."""
payment_counts = defaultdict(int)
for r in results:
methods = r.get("payment_methods", [])
for m in methods:
method_type = m.get("method", "UNKNOWN")
payment_counts[method_type] += 1
return {
"methods": dict(payment_counts),
"has_mixed_payments": len(payment_counts) > 1
}
def analyze_client_patterns(results: List[Dict]) -> Dict:
"""Analyze client (B2B) patterns."""
has_client_cui = 0
has_client_name = 0
for r in results:
if r.get("client_cui"):
has_client_cui += 1
if r.get("client_name"):
has_client_name += 1
return {
"has_client_cui": has_client_cui > 0,
"has_client_name": has_client_name > 0,
"b2b_ratio": has_client_cui / len(results) if results else 0
}
def generate_profile_code(
store_name: str,
cui: str,
tva_analysis: Dict,
total_analysis: Dict,
date_analysis: Dict,
payment_analysis: Dict,
client_analysis: Dict
) -> str:
"""
Generate Python profile class code.
Args:
store_name: Human-readable store name
cui: CUI number (without RO prefix)
*_analysis: Analysis results from pattern detection
Returns:
Python source code for the profile class
"""
# Generate class name from store name
class_name = "".join(
word.capitalize()
for word in re.sub(r'[^a-zA-Z0-9\s]', '', store_name).split()
) + "Profile"
# Generate module name
module_name = re.sub(r'[^a-z0-9]', '_', store_name.lower()).strip('_')
# Determine profile characteristics
is_non_vat = tva_analysis.get("is_non_vat", False)
has_multi_rate = tva_analysis.get("has_multi_rate", False)
has_client_cui = client_analysis.get("has_client_cui", False)
uses_yyyy_mm_dd = date_analysis.get("dominant_format") == "YYYY.MM.DD"
# Generate OCR name patterns
name_words = store_name.upper().split()
primary_word = name_words[0] if name_words else store_name.upper()
name_patterns = [
primary_word,
store_name.upper().replace(".", "").replace(",", ""),
]
# Add OCR error variants
ocr_variants = {
'O': '0', 'I': '1', 'L': '1', 'S': '5', 'B': '8', 'E': '3'
}
for char, replacement in ocr_variants.items():
if char in primary_word:
name_patterns.append(primary_word.replace(char, replacement, 1))
name_patterns = list(dict.fromkeys(name_patterns))[:4] # Unique, max 4
# Build the code
code_lines = [
'"""',
f'{store_name} store profile for OCR extraction.',
'',
'Auto-generated by generate_store_profile.py',
f'Generated: {datetime.now().strftime("%Y-%m-%d %H:%M")}',
'"""',
'',
'import re',
'from decimal import Decimal, InvalidOperation',
'from typing import List, Dict, Any',
'',
'from .base import BaseStoreProfile',
'from . import ProfileRegistry',
'',
'',
'@ProfileRegistry.register',
f'class {class_name}(BaseStoreProfile):',
' """',
f' {store_name} - OCR extraction profile.',
' ',
]
# Add characteristics to docstring
characteristics = []
if is_non_vat:
characteristics.append("Non-VAT payer (neplatitor TVA)")
if has_multi_rate:
characteristics.append("Multi-rate TVA")
if has_client_cui:
characteristics.append("B2B receipts with client CUI")
if uses_yyyy_mm_dd:
characteristics.append("Date format: YYYY.MM.DD")
if characteristics:
code_lines.append(' Key characteristics:')
for c in characteristics:
code_lines.append(f' - {c}')
code_lines.append(' ')
code_lines.extend([
' """',
'',
f' CUI_LIST = ["{cui}"]',
f' NAME_PATTERNS = {name_patterns}',
f' STORE_NAME = "{store_name}"',
'',
])
# Add date patterns override for YYYY.MM.DD format
if uses_yyyy_mm_dd:
code_lines.extend([
' # Override date patterns for YYYY.MM.DD format',
' DATE_PATTERNS_OCR_SPACES = [',
' r\'(\\d{4})[.,]\\s*(\\d{2})[.,]\\s*(\\d{2})\', # YYYY. MM. DD with spaces',
' r\'(\\d{4})[.,](\\d{2})[.,](\\d{2})\', # YYYY.MM.DD',
' ]',
'',
])
# Add TVA extraction method for multi-rate or non-VAT
if is_non_vat:
code_lines.extend([
' def extract_tva_entries(self, text: str) -> List[dict]:',
' """Non-VAT payer - returns empty list."""',
' return []',
'',
])
elif has_multi_rate and tva_analysis.get("dominant_format") == "lidl_multi_rate":
code_lines.extend([
' # Store-specific TVA patterns',
' TVA_PATTERNS = [',
' r\'T[VU][AR]\\s+([A-D])\\s+(\\d{1,2})[.,]?\\d{0,2}\\s*%\\s+([\\d.,]+)\',',
' ]',
'',
' def extract_tva_entries(self, text: str) -> List[dict]:',
' """Extract multi-rate TVA entries."""',
' entries = []',
' seen = set()',
'',
' for pattern in self.TVA_PATTERNS:',
' for match in re.finditer(pattern, text, re.IGNORECASE):',
' try:',
' code = match.group(1).upper()',
' percent = int(match.group(2))',
' amount = self._parse_decimal(match.group(3))',
'',
' if amount and amount > 0:',
' entry_key = (code, percent)',
' if entry_key not in seen:',
' entries.append({',
' \'code\': code,',
' \'percent\': percent,',
' \'amount\': amount',
' })',
' seen.add(entry_key)',
' except (ValueError, InvalidOperation):',
' continue',
'',
' return entries',
'',
])
# Add validation hints method
code_lines.extend([
' def get_validation_hints(self) -> Dict[str, Any]:',
f' """Return {store_name}-specific validation hints."""',
' return {',
f' "has_multi_rate_tva": {has_multi_rate},',
f' "card_equals_total": True,',
f' "has_client_cui": {has_client_cui},',
f' "has_efactura": False,',
f' "is_non_vat_payer": {is_non_vat},',
' }',
])
return '\n'.join(code_lines) + '\n'
def main():
parser = argparse.ArgumentParser(
description="Generate store profile from PDF receipts",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Generate profile from a single PDF
python scripts/generate_store_profile.py \\
--name "Magazin Nou" --cui "12345678" \\
--receipts "docs/data-entry/magazin_nou.pdf"
# Generate profile from multiple PDFs (glob pattern)
python scripts/generate_store_profile.py \\
--name "Carrefour" --cui "2475489" \\
--receipts "docs/data-entry/Carrefour*.pdf" \\
--output backend/modules/data_entry/services/ocr/profiles/carrefour.py
# Dry run (analyze only, don't write file)
python scripts/generate_store_profile.py \\
--name "Test Store" --cui "11111111" \\
--receipts "docs/data-entry/test*.pdf" \\
--dry-run
"""
)
parser.add_argument("--name", required=True, help="Store name (e.g., 'LIDL DISCOUNT S.R.L.')")
parser.add_argument("--cui", required=True, help="CUI number without RO prefix")
parser.add_argument("--receipts", required=True, help="PDF file path or glob pattern")
parser.add_argument("--output", help="Output file path (default: auto-generated)")
parser.add_argument("--dry-run", action="store_true", help="Analyze only, don't write file")
parser.add_argument("--api-base", default=API_BASE, help=f"API base URL (default: {API_BASE})")
args = parser.parse_args()
# Update API base if provided
api_base = args.api_base
# Validate CUI format
cui = args.cui.strip().replace("RO", "").replace(" ", "")
if not cui.isdigit() or len(cui) < 6 or len(cui) > 10:
print(f"Error: Invalid CUI format: {args.cui}")
sys.exit(1)
# Find PDF files
pdf_files = glob.glob(args.receipts)
if not pdf_files:
print(f"Error: No PDF files found matching: {args.receipts}")
sys.exit(1)
print(f"\n{'='*60}")
print(f"Store Profile Generator")
print(f"{'='*60}")
print(f"Store: {args.name}")
print(f"CUI: {cui}")
print(f"PDFs: {len(pdf_files)} files")
print(f"{'='*60}\n")
# Generate JWT token
token = create_jwt_token()
# Submit PDFs to OCR
print("Step 1: Submitting PDFs to OCR API...")
results = []
for pdf_path in pdf_files:
result = submit_ocr(pdf_path, token, api_base=api_base)
if result:
results.append(result)
if not results:
print("\nError: No successful extractions. Check if backend is running.")
sys.exit(1)
print(f"\nSuccessfully extracted: {len(results)}/{len(pdf_files)} PDFs")
# Analyze patterns
print("\nStep 2: Analyzing patterns...")
tva_analysis = analyze_tva_patterns(results)
total_analysis = analyze_total_patterns(results)
date_analysis = analyze_date_patterns(results)
payment_analysis = analyze_payment_patterns(results)
client_analysis = analyze_client_patterns(results)
print(f" TVA: {tva_analysis['dominant_format']} format, multi-rate={tva_analysis['has_multi_rate']}")
print(f" Date: {date_analysis['dominant_format']} format")
print(f" Payments: {list(payment_analysis['methods'].keys())}")
print(f" B2B: {client_analysis['has_client_cui']}")
# Generate profile code
print("\nStep 3: Generating profile code...")
code = generate_profile_code(
store_name=args.name,
cui=cui,
tva_analysis=tva_analysis,
total_analysis=total_analysis,
date_analysis=date_analysis,
payment_analysis=payment_analysis,
client_analysis=client_analysis
)
# Determine output path
if args.output:
output_path = args.output
else:
module_name = re.sub(r'[^a-z0-9]', '_', args.name.lower()).strip('_')
output_path = f"backend/modules/data_entry/services/ocr/profiles/{module_name}.py"
if args.dry_run:
print(f"\n[DRY RUN] Would write to: {output_path}")
print(f"\n{'='*60}")
print("Generated code:")
print(f"{'='*60}")
print(code)
else:
# Write file
os.makedirs(os.path.dirname(output_path), exist_ok=True)
with open(output_path, 'w') as f:
f.write(code)
print(f" Written to: {output_path}")
# Verify syntax
import py_compile
try:
py_compile.compile(output_path, doraise=True)
print(f" Syntax check: OK")
except py_compile.PyCompileError as e:
print(f" Syntax check: FAILED - {e}")
print(f"\n{'='*60}")
print("Profile generation complete!")
print(f"{'='*60}")
if not args.dry_run:
print(f"\nNext steps:")
print(f"1. Review the generated code: {output_path}")
print(f"2. Customize patterns if needed")
print(f"3. Hot-reload profiles: curl -X POST http://localhost:8000/api/data-entry/ocr/profiles/reload")
print(f"4. Test with a sample receipt")
if __name__ == "__main__":
main()