diff --git a/.claude/rules/claude-learn-backend.md b/.claude/rules/claude-learn-backend.md
new file mode 100644
index 0000000..3cd3a9f
--- /dev/null
+++ b/.claude/rules/claude-learn-backend.md
@@ -0,0 +1,106 @@
+# Claude Learn: Backend
+
+**Domain**: backend
+**Last updated**: 2026-01-06
+**Sessions recorded**: 1
+
+Knowledge about FastAPI, Python services, Oracle DB, and backend architecture.
+
+---
+
+## Patterns
+
+### ProfileRegistry cu Hot-Reload pentru Store Profiles
+**Discovered**: 2026-01-06 (feature: ocr-store-profiles)
+**Description**: Sistem de înregistrare profile OCR folosind decorator `@ProfileRegistry.register` cu hot-reload via `importlib.reload()`. Permite adăugarea/modificarea profilelor fără restart server.
+
+**Example** (`backend/modules/data_entry/services/ocr/profiles/__init__.py`):
+```python
+class ProfileRegistry:
+ _profiles: Dict[str, Type["BaseStoreProfile"]] = {}
+ _instances: Dict[str, "BaseStoreProfile"] = {}
+
+ @classmethod
+ def register(cls, profile_class):
+ """Decorator to register a store profile class."""
+ for cui in profile_class.CUI_LIST:
+ cls._profiles[cls._normalize_cui(cui)] = profile_class
+ return profile_class
+
+ @classmethod
+ def reload_all(cls):
+ """Hot-reload all profile modules via importlib.reload()."""
+ cls._instances.clear()
+ for module_name in cls._get_profile_module_names():
+ importlib.reload(sys.modules[f"backend...profiles.{module_name}"])
+```
+
+**Usage**:
+```python
+@ProfileRegistry.register
+class LidlProfile(BaseStoreProfile):
+ CUI_LIST = ["22891860"]
+ STORE_NAME = "LIDL DISCOUNT S.R.L."
+
+# Lookup
+profile = ProfileRegistry.get_profile("22891860")
+
+# Hot-reload (endpoint)
+POST /api/data-entry/ocr/profiles/reload
+```
+
+**Tags**: registry-pattern, hot-reload, decorator, ocr, singleton
+
+---
+
+### Script generare cod Python din analiză PDF
+**Discovered**: 2026-01-06 (feature: ocr-store-profiles)
+**Description**: Script care analizează PDF-uri via OCR API, detectează pattern-uri (TVA format, date format, payment) și generează automat cod Python pentru profile noi. Include JWT auth, async polling, și verificare sintaxă.
+
+**Example** (`scripts/generate_store_profile.py`):
+```python
+def analyze_tva_patterns(results: List[Dict]) -> Dict:
+ """Detectează format TVA dominant din rezultatele OCR."""
+ tva_formats = defaultdict(int)
+ for text in raw_texts:
+ if re.search(r'TVA\s+[A-D]\s+\d{1,2}', text_upper):
+ tva_formats["lidl_multi_rate"] += 1
+ if re.search(r'BAZA\s+TVA', text_upper):
+ tva_formats["table"] += 1
+ return {"dominant_format": max(tva_formats, key=tva_formats.get)}
+
+def generate_profile_code(store_name, cui, tva_analysis, ...):
+ """Generează cod Python pentru clasa de profil."""
+ # Template-based generation cu OCR error variants
+```
+
+**Usage**:
+```bash
+# Dry-run pentru preview
+python scripts/generate_store_profile.py \
+ --name "Magazin Nou" --cui "12345678" \
+ --receipts "docs/data-entry/MagazinNou*.pdf" --dry-run
+
+# Generează și salvează
+python scripts/generate_store_profile.py \
+ --name "Magazin Nou" --cui "12345678" \
+ --receipts "docs/data-entry/MagazinNou*.pdf" \
+ --output backend/.../profiles/magazin_nou.py
+```
+
+**Tags**: code-generation, ocr, automation, cli-tool
+
+---
+
+## Gotchas
+
+_(None recorded yet)_
+
+---
+
+## Statistics
+
+- **Total Patterns**: 2
+- **Total Gotchas**: 0
+- **Last Session**: 2026-01-06
+- **Sessions Recorded**: 1
diff --git a/.claude/rules/claude-learn-database.md b/.claude/rules/claude-learn-database.md
new file mode 100644
index 0000000..5997042
--- /dev/null
+++ b/.claude/rules/claude-learn-database.md
@@ -0,0 +1,28 @@
+# Claude Learn: Database
+
+**Domain**: database
+**Last updated**: -
+**Sessions recorded**: 0
+
+Knowledge about Oracle DB, SQLite, SQLModel, migrations, and data modeling.
+
+---
+
+## Patterns
+
+_(None recorded yet)_
+
+---
+
+## Gotchas
+
+_(None recorded yet)_
+
+---
+
+## Statistics
+
+- **Total Patterns**: 0
+- **Total Gotchas**: 0
+- **Last Session**: -
+- **Sessions Recorded**: 0
diff --git a/.claude/rules/claude-learn-deployment.md b/.claude/rules/claude-learn-deployment.md
new file mode 100644
index 0000000..c7b031f
--- /dev/null
+++ b/.claude/rules/claude-learn-deployment.md
@@ -0,0 +1,55 @@
+# Claude Learn: Deployment
+
+**Domain**: deployment
+**Last updated**: 2026-01-06
+**Sessions recorded**: 1
+
+Knowledge about IIS, Docker, deployment scripts, and infrastructure.
+
+---
+
+## Patterns
+
+### IIS URL Rewrite Rules for SPA with Multiple API Backends
+**Discovered**: 2025-12-22 (feature: unified-app)
+**Description**: Configure IIS web.config to proxy different API paths to different backend ports while serving SPA for all other routes. Enables single IIS site to route to multiple microservices.
+
+**Example** (`public/web.config:5-28`):
+```xml
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+```
+
+**Tags**: iis, deployment, spa, microservices, proxy
+
+---
+
+## Gotchas
+
+_(None recorded yet)_
+
+---
+
+## Statistics
+
+- **Total Patterns**: 1
+- **Total Gotchas**: 0
+- **Last Session**: 2026-01-06
+- **Sessions Recorded**: 1
diff --git a/.claude/rules/claude-learn-domains.md b/.claude/rules/claude-learn-domains.md
new file mode 100644
index 0000000..751e43f
--- /dev/null
+++ b/.claude/rules/claude-learn-domains.md
@@ -0,0 +1,83 @@
+# Claude Learn Domains Configuration
+
+**Last updated**: 2026-01-06
+
+This file defines available knowledge domains and their file path patterns.
+
+---
+
+## Domains
+
+### frontend
+**File**: `claude-learn-frontend.md`
+**Patterns**:
+- `src/**/*.vue`
+- `src/**/*.js`
+- `src/**/*.ts`
+- `src/**/*.css`
+- `vite.config.*`
+- `package.json`
+
+---
+
+### backend
+**File**: `claude-learn-backend.md`
+**Patterns**:
+- `backend/**/*.py`
+- `backend/modules/**/*`
+- `requirements.txt`
+
+---
+
+### database
+**File**: `claude-learn-database.md`
+**Patterns**:
+- `**/*.sql`
+- `**/models.py`
+- `**/schemas.py`
+- `backend/**/db/**/*`
+
+---
+
+### testing
+**File**: `claude-learn-testing.md`
+**Patterns**:
+- `tests/**/*`
+- `**/*.test.*`
+- `**/*.spec.*`
+- `pytest.ini`
+- `vitest.config.*`
+
+---
+
+### deployment
+**File**: `claude-learn-deployment.md`
+**Patterns**:
+- `deployment/**/*`
+- `public/web.config`
+- `Dockerfile*`
+- `docker-compose*.yml`
+- `*.sh`
+- `ansible/**/*`
+
+---
+
+### global
+**File**: `claude-learn-global.md`
+**Patterns**:
+- `*` (catch-all for cross-cutting concerns)
+
+---
+
+## Statistics
+
+| Domain | Patterns | Gotchas | Last Updated |
+|--------|----------|---------|--------------|
+| frontend | 8 | 10 | 2026-01-06 |
+| deployment | 1 | 0 | 2026-01-06 |
+| global | 0 | 1 | 2026-01-06 |
+| backend | 2 | 0 | 2026-01-06 |
+| database | 0 | 0 | - |
+| testing | 0 | 0 | - |
+
+**Total**: 11 patterns, 11 gotchas across 4 domains
diff --git a/.claude/rules/auto-build-memory.md b/.claude/rules/claude-learn-frontend.md
similarity index 84%
rename from .claude/rules/auto-build-memory.md
rename to .claude/rules/claude-learn-frontend.md
index d645842..bc43694 100644
--- a/.claude/rules/auto-build-memory.md
+++ b/.claude/rules/claude-learn-frontend.md
@@ -1,9 +1,10 @@
-# Learned Patterns & Gotchas
+# Claude Learn: Frontend
-**Last updated**: 2025-12-24
-**Maintained**: Manually (add new patterns/gotchas as discovered)
+**Domain**: frontend
+**Last updated**: 2026-01-06
+**Sessions recorded**: 3
-This file contains insights learned during feature implementations. Claude Code auto-loads this file to prevent repeating past mistakes.
+Knowledge about Vue.js, Vite, Pinia, CSS, and frontend architecture.
---
@@ -130,37 +131,6 @@ resolve: {
---
-### IIS URL Rewrite Rules for SPA with Multiple API Backends
-**Discovered**: 2025-12-22 (feature: unified-app)
-**Description**: Configure IIS web.config to proxy different API paths to different backend ports while serving SPA for all other routes. Enables single IIS site to route to multiple microservices.
-
-**Example** (`public/web.config:5-28`):
-```xml
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-```
-
-**Tags**: iis, deployment, spa, microservices, proxy
-
----
-
### Vue Watcher for Auto-Loading Dependent Data
**Discovered**: 2025-12-24 (feature: unified-app-ux)
**Description**: Use Vue watch() to automatically trigger data loading when dependent selections change. Watch company selection changes to auto-load accounting periods, ensuring UI stays synchronized without manual intervention.
@@ -248,15 +218,6 @@ const getStorageKey = () => {
---
-### Sed Command Quote Mismatch in Bulk Find-Replace
-**Discovered**: 2025-12-22 (feature: unified-app)
-**Problem**: Bulk sed commands using single quotes in pattern didn't match imports using double quotes, and vice versa. Commands like sed 's|from '@/stores/'|...' didn't replace from "@/stores/" lines.
-**Solution**: Always use the quote style that matches the target files. For Vue/JS files with ESLint using double quotes, use double quotes in sed patterns. Better yet: use find -exec with separate sed for each file to handle both quote styles.
-
-**Tags**: sed, regex, scripting, find-replace, migration
-
----
-
### Circular Reference in API Wrapper
**Discovered**: 2025-12-22 (feature: unified-app)
**Problem**: receiptsStore.js failed to build with 'Identifier api has already been declared' because it imported api and then declared const api = { ... } wrapper object using the same name.
@@ -287,7 +248,7 @@ const getStorageKey = () => {
### Vite Build Transform Count is Progress Indicator
**Discovered**: 2025-12-22 (feature: unified-app)
**Problem**: Hard to tell if build is making progress when fixing import issues. Each fix revealed new errors, causing frustration.
-**Solution**: Watch the 'transforming... ✓ N modules transformed' count - it increases with each successful fix even if build ultimately fails. Going from 200→573→1490→1492 modules meant we were getting close to success. Use this as encouragement!
+**Solution**: Watch the 'transforming... N modules transformed' count - it increases with each successful fix even if build ultimately fails. Going from 200->573->1490->1492 modules meant we were getting close to success. Use this as encouragement!
**Tags**: vite, build, debugging, progress-tracking, developer-experience
@@ -329,9 +290,9 @@ const getStorageKey = () => {
---
-## Memory Statistics
+## Statistics
-- **Total Patterns**: 9
-- **Total Gotchas**: 11
-- **Last Session**: 2025-12-24 (unified-app-ux)
-- **Sessions Recorded**: 2
+- **Total Patterns**: 8
+- **Total Gotchas**: 10
+- **Last Session**: 2026-01-06
+- **Sessions Recorded**: 3
diff --git a/.claude/rules/claude-learn-global.md b/.claude/rules/claude-learn-global.md
new file mode 100644
index 0000000..3a27a79
--- /dev/null
+++ b/.claude/rules/claude-learn-global.md
@@ -0,0 +1,33 @@
+# Claude Learn: Global
+
+**Domain**: global
+**Last updated**: 2026-01-06
+**Sessions recorded**: 1
+
+Cross-cutting knowledge applicable to multiple domains (scripting, tooling, workflow).
+
+---
+
+## Patterns
+
+_(None recorded yet)_
+
+---
+
+## Gotchas
+
+### Sed Command Quote Mismatch in Bulk Find-Replace
+**Discovered**: 2025-12-22 (feature: unified-app)
+**Problem**: Bulk sed commands using single quotes in pattern didn't match imports using double quotes, and vice versa. Commands like sed 's|from '@/stores/'|...' didn't replace from "@/stores/" lines.
+**Solution**: Always use the quote style that matches the target files. For Vue/JS files with ESLint using double quotes, use double quotes in sed patterns. Better yet: use find -exec with separate sed for each file to handle both quote styles.
+
+**Tags**: sed, regex, scripting, find-replace, migration
+
+---
+
+## Statistics
+
+- **Total Patterns**: 0
+- **Total Gotchas**: 1
+- **Last Session**: 2026-01-06
+- **Sessions Recorded**: 1
diff --git a/.claude/rules/claude-learn-testing.md b/.claude/rules/claude-learn-testing.md
new file mode 100644
index 0000000..a1527bf
--- /dev/null
+++ b/.claude/rules/claude-learn-testing.md
@@ -0,0 +1,28 @@
+# Claude Learn: Testing
+
+**Domain**: testing
+**Last updated**: -
+**Sessions recorded**: 0
+
+Knowledge about pytest, Vitest, test patterns, and validation strategies.
+
+---
+
+## Patterns
+
+_(None recorded yet)_
+
+---
+
+## Gotchas
+
+_(None recorded yet)_
+
+---
+
+## Statistics
+
+- **Total Patterns**: 0
+- **Total Gotchas**: 0
+- **Last Session**: -
+- **Sessions Recorded**: 0
diff --git a/backend/modules/data_entry/routers/ocr.py b/backend/modules/data_entry/routers/ocr.py
index c1846f1..4965182 100644
--- a/backend/modules/data_entry/routers/ocr.py
+++ b/backend/modules/data_entry/routers/ocr.py
@@ -628,3 +628,86 @@ def _dict_to_extraction_data(data: dict) -> ExtractionData:
validation_errors=data.get('validation_errors', []),
inter_ocr_ratios=data.get('inter_ocr_ratios', {}),
)
+
+
+# ============================================================================
+# Store Profiles Management Endpoints
+# ============================================================================
+
+@router.post("/profiles/reload")
+async def reload_store_profiles(
+ current_user: CurrentUser = Depends(get_current_user)
+) -> dict:
+ """
+ Hot-reload all store profiles.
+
+ Reloads profile Python modules without server restart.
+ Use after adding/modifying profile files.
+
+ Returns:
+ Dict with reloaded count and profile list
+ """
+ from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
+
+ count = ProfileRegistry.reload_all()
+ status = ProfileRegistry.get_reload_status()
+
+ return {
+ "success": True,
+ "reloaded_modules": count,
+ "profiles_count": status["profiles_count"],
+ "registered_cuis": status["registered_cuis"],
+ "last_reload": status["last_reload"],
+ }
+
+
+@router.get("/profiles")
+async def list_store_profiles(
+ current_user: CurrentUser = Depends(get_current_user)
+) -> dict:
+ """
+ List all registered store profiles.
+
+ Returns:
+ Dict with profiles list and status
+ """
+ from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
+
+ profiles = ProfileRegistry.list_profiles()
+ status = ProfileRegistry.get_reload_status()
+
+ return {
+ "profiles": profiles,
+ "count": len(profiles),
+ "last_reload": status["last_reload"],
+ }
+
+
+@router.get("/profiles/{cui}")
+async def get_store_profile(
+ cui: str,
+ current_user: CurrentUser = Depends(get_current_user)
+) -> dict:
+ """
+ Get details for a specific store profile.
+
+ Args:
+ cui: Store CUI (with or without RO prefix)
+
+ Returns:
+ Profile details including validation hints
+
+ Raises:
+ 404: If no profile exists for this CUI
+ """
+ from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
+
+ info = ProfileRegistry.get_profile_info(cui)
+
+ if not info:
+ raise HTTPException(
+ status_code=404,
+ detail=f"No profile registered for CUI: {cui}"
+ )
+
+ return info
diff --git a/backend/modules/data_entry/services/ocr/profiles/README.md b/backend/modules/data_entry/services/ocr/profiles/README.md
new file mode 100644
index 0000000..65c7695
--- /dev/null
+++ b/backend/modules/data_entry/services/ocr/profiles/README.md
@@ -0,0 +1,258 @@
+# Store Profiles - OCR Extraction
+
+Sistem de profile specifice pentru extracție OCR cu hot-reload.
+
+---
+
+## Quick Start: Adaugă un profil nou
+
+```bash
+# 1. Generează profil din PDF-uri (dry-run pentru preview)
+python scripts/generate_store_profile.py \
+ --name "Magazin Nou SRL" \
+ --cui "12345678" \
+ --receipts "docs/data-entry/MagazinNou*.pdf" \
+ --dry-run
+
+# 2. Generează și salvează
+python scripts/generate_store_profile.py \
+ --name "Magazin Nou SRL" \
+ --cui "12345678" \
+ --receipts "docs/data-entry/MagazinNou*.pdf" \
+ --output backend/modules/data_entry/services/ocr/profiles/magazin_nou.py
+
+# 3. Hot-reload (fără restart server)
+curl -X POST http://localhost:8000/api/data-entry/ocr/profiles/reload
+
+# 4. Verifică
+curl http://localhost:8000/api/data-entry/ocr/profiles
+```
+
+---
+
+## Structura directorului
+
+```
+profiles/
+├── __init__.py # ProfileRegistry + hot-reload (~390 linii)
+├── base.py # BaseStoreProfile + pattern-uri generice (~410 linii)
+├── lidl.py # Multi-rate TVA (A/B)
+├── omv.py # B2B, date YYYY.MM.DD
+├── socar.py # B2B, date YYYY.MM.DD
+├── brick.py # Standard TVA
+├── dedeman.py # E-factura support
+├── kineterra.py # Non-VAT payer
+├── gama_ink.py # Standard TVA (toner/cartușe)
+├── electrobering.py # Standard TVA (electronice)
+├── pictus_velum.py # Standard TVA (rechizite)
+├── unlimited_keys.py # Standard TVA, NUMERAR payment
+├── best_print.py # Non-VAT payer (neplătitor TVA)
+├── stepout_market.py # TVA 5% (cărți/librărie)
+└── README.md # Acest fișier
+```
+
+---
+
+## Profile existente (12 profile)
+
+> **Note**: Pattern-urile TVA sunt **flexibile** și acceptă ORICE cotă (5%, 9%, 11%, 19%, 21%, etc.)
+> pentru a gestiona atât datele istorice cât și schimbările viitoare ale legislației.
+
+| Magazin | CUI | Fișier | Caracteristici |
+|---------|-----|--------|----------------|
+| LIDL DISCOUNT S.R.L. | 22891860 | `lidl.py` | Multi-rate TVA (coduri A, B, C, D) |
+| OMV PETROM MARKETING S.R.L. | 11201891 | `omv.py` | B2B (client CUI), date YYYY.MM.DD |
+| SOCAR PETROLEUM S.A. | 12546600 | `socar.py` | B2B (client CUI), date YYYY.MM.DD |
+| FIVE-HOLDING S.A. (BRICK) | 10562600 | `brick.py` | Standard TVA |
+| DEDEMAN SRL | 2816464 | `dedeman.py` | E-factura support |
+| KINETERRA CONCEPT SRL | 31180432 | `kineterra.py` | Non-VAT payer (returnează `[]`) |
+| GAMA INK SERVICE SRL | 17741882 | `gama_ink.py` | Standard TVA (toner, cartușe) |
+| ELECTROBERING S.R.L. | 2744937 | `electrobering.py` | Standard TVA (electronice) |
+| PICTUS VELUM SRL | 39634534 | `pictus_velum.py` | Standard TVA (rechizite) |
+| UNLIMITED KEYS S.R.L. | 18993187 | `unlimited_keys.py` | Standard TVA, **NUMERAR** plată |
+| BEST PRINT TRADE ACTIV SRL | 45417955 | `best_print.py` | **Non-VAT payer** (neplătitor TVA) |
+| STEPOUT MARKET SRL | 35532655 | `stepout_market.py` | TVA 5% (cărți, librărie) |
+
+---
+
+## API Endpoints
+
+| Endpoint | Metodă | Descriere |
+|----------|--------|-----------|
+| `/api/data-entry/ocr/profiles` | GET | Lista toate profilele |
+| `/api/data-entry/ocr/profiles/{cui}` | GET | Detalii profil (acceptă RO prefix) |
+| `/api/data-entry/ocr/profiles/reload` | POST | Hot-reload toate profilele |
+
+### Exemple API
+
+```bash
+# Lista profile
+curl http://localhost:8000/api/data-entry/ocr/profiles \
+ -H "Authorization: Bearer "
+
+# Detalii profil (cu sau fără RO prefix)
+curl http://localhost:8000/api/data-entry/ocr/profiles/22891860
+curl http://localhost:8000/api/data-entry/ocr/profiles/RO22891860
+
+# Hot-reload după modificări
+curl -X POST http://localhost:8000/api/data-entry/ocr/profiles/reload \
+ -H "Authorization: Bearer "
+
+# Response reload:
+{
+ "success": true,
+ "reloaded_modules": 12,
+ "profiles_count": 12,
+ "registered_cuis": ["22891860", "11201891", "12546600", "10562600", ...],
+ "last_reload": "2026-01-06T22:37:05.000000"
+}
+```
+
+---
+
+## Cum funcționează sistemul
+
+### Flow de extracție
+
+```
+ReceiptExtractor.extract()
+ │
+ ├─► STEP 1: Extrage vendor + CUI
+ │ └─► _extract_vendor(), _extract_cui()
+ │
+ ├─► ProfileRegistry.get_profile(cui)
+ │ └─► Returnează profil specific sau None
+ │
+ ├─► STEP 2: Extracție cu profil (dacă există)
+ │ ├─► profile.extract_total()
+ │ ├─► profile.extract_date()
+ │ ├─► profile.extract_receipt_number()
+ │ ├─► profile.extract_tva_entries()
+ │ ├─► profile.extract_payment_methods()
+ │ └─► profile.extract_client_cui()
+ │
+ └─► STEP 3-4: Validare + post-procesare
+```
+
+### Fallback
+
+Dacă nu există profil pentru CUI, se folosește logica generică din `ReceiptExtractor`.
+
+---
+
+## Structura unui profil
+
+```python
+from .base import BaseStoreProfile
+from . import ProfileRegistry
+
+@ProfileRegistry.register
+class MagazinNouProfile(BaseStoreProfile):
+ """Docstring cu descriere magazin."""
+
+ CUI_LIST = ["12345678"] # Poate avea mai multe CUI-uri
+ NAME_PATTERNS = ["MAGAZIN", "MAGAZIN NOU", "MAG4ZIN"] # OCR variants
+ STORE_NAME = "Magazin Nou SRL"
+
+ # Override doar ce e diferit de base class
+ def extract_tva_entries(self, text: str) -> List[dict]:
+ # Pattern-uri specifice magazinului
+ ...
+
+ def get_validation_hints(self) -> Dict[str, Any]:
+ return {
+ "has_multi_rate_tva": False,
+ "card_equals_total": True,
+ "has_client_cui": False,
+ "has_efactura": False,
+ "is_non_vat_payer": False,
+ }
+```
+
+---
+
+## Pattern-uri disponibile în base.py
+
+BaseStoreProfile include pattern-uri generice OCR-tolerant:
+
+| Pattern | Descriere |
+|---------|-----------|
+| `TOTAL_PATTERNS` | 8 variante pentru TOTAL (TOTAL:, TOTAL DE PLATA, etc.) |
+| `DATE_PATTERNS` | 6 variante (DD.MM.YYYY, YYYY-MM-DD, DD/MM/YYYY) |
+| `DATE_PATTERNS_OCR_SPACES` | 4 variante cu spații OCR ("2025. 08. 14") |
+| `NUMBER_PATTERNS` | 11 variante pentru număr bon (NDS, BF, C3POS) |
+| `PAYMENT_PATTERNS` | 8 variante pentru CARD/NUMERAR |
+| `CLIENT_MARKERS` | 6 variante pentru secțiune CLIENT |
+| `CLIENT_CUI_PATTERNS` | 7 variante pentru CUI client |
+
+### Metode implementate în base class
+
+- `extract_total(text)` → `Tuple[Decimal, float]`
+- `extract_date(text)` → `Tuple[date, float]`
+- `extract_receipt_number(text)` → `Tuple[str, float]`
+- `extract_payment_methods(text)` → `List[dict]`
+- `extract_client_cui(text)` → `Tuple[str, float]`
+- `extract_client_name(text)` → `Tuple[str, float]`
+
+---
+
+## Când ai nevoie de profil custom?
+
+| Situație | Exemplu | Ce trebuie override |
+|----------|---------|---------------------|
+| **Multi-rate TVA** | Lidl (TVA A, TVA B) | `extract_tva_entries()` |
+| **Format dată special** | OMV/Socar (YYYY.MM.DD) | `DATE_PATTERNS_OCR_SPACES` |
+| **B2B receipts** | Benzinării (au client CUI) | `extract_client_cui()` |
+| **Non-VAT payer** | Kineterra | `extract_tva_entries()` returnează `[]` |
+| **E-factura** | Dedeman | `extract_efactura_reference()` |
+
+---
+
+## Decizii de design
+
+1. **Hot-reload manual** - endpoint `/profiles/reload` apelat când se modifică fișiere
+2. **Persistență în Python** - profile în Git, version controlled
+3. **Fallback graceful** - dacă nu există profil, folosește logica generică
+4. **CUI normalization** - gestionează automat prefixul "RO" și whitespace
+5. **Deduplicare TVA** - folosește `seen = set()` pentru a evita duplicate
+
+---
+
+## Comenzi utile
+
+```bash
+# Verifică syntax Python pentru toate profilele
+for f in backend/modules/data_entry/services/ocr/profiles/*.py; do
+ python3 -m py_compile "$f" && echo "✓ $(basename $f)"
+done
+
+# Lista profile
+ls -la backend/modules/data_entry/services/ocr/profiles/
+
+# Pornește backend pentru testare
+cd backend && source venv/bin/activate
+uvicorn main:app --host 0.0.0.0 --port 8000 --workers 1
+
+# Test OCR pe un PDF
+curl -X POST -F "file=@docs/data-entry/test.pdf" \
+ -H "Authorization: Bearer " \
+ "http://localhost:8000/api/data-entry/ocr/extract?engine=doctr_plus"
+```
+
+---
+
+## Script generare profile
+
+`scripts/generate_store_profile.py` - generator automat de profile
+
+```bash
+# Vezi help
+python scripts/generate_store_profile.py --help
+
+# Funcționalități:
+# - Analizează PDF-uri via OCR API
+# - Detectează: TVA format, date format, payment patterns, B2B
+# - Generează cod Python cu OCR error variants
+# - Suportă glob patterns (*.pdf)
+# - Verifică sintaxa după generare
+```
diff --git a/backend/modules/data_entry/services/ocr/profiles/__init__.py b/backend/modules/data_entry/services/ocr/profiles/__init__.py
new file mode 100644
index 0000000..cb0be54
--- /dev/null
+++ b/backend/modules/data_entry/services/ocr/profiles/__init__.py
@@ -0,0 +1,388 @@
+"""
+Store Profiles Registry with Hot-Reload Support.
+
+This module provides a registry for store-specific OCR extraction profiles.
+Profiles can be reloaded at runtime without restarting the server.
+
+Usage:
+ from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
+
+ # Get profile for a CUI
+ profile = ProfileRegistry.get_profile("22891860")
+ if profile:
+ tva_entries = profile.extract_tva_entries(text)
+
+ # Reload all profiles (after file changes)
+ count = ProfileRegistry.reload_all()
+
+Architecture:
+ - ProfileRegistry: Singleton registry with class methods
+ - BaseStoreProfile: Abstract base class for profiles
+ - @ProfileRegistry.register: Decorator for profile classes
+
+Hot-Reload Mechanism:
+ 1. Admin calls POST /profiles/reload endpoint
+ 2. Registry clears instance cache
+ 3. importlib.reload() re-executes each profile module
+ 4. @register decorator re-registers classes with new code
+"""
+
+from __future__ import annotations
+
+import importlib
+import logging
+import sys
+from datetime import datetime
+from pathlib import Path
+from typing import Dict, List, Optional, Type, TYPE_CHECKING
+
+if TYPE_CHECKING:
+ from .base import BaseStoreProfile
+
+logger = logging.getLogger(__name__)
+
+# Directory containing profile modules
+PROFILES_DIR = Path(__file__).parent
+
+
+class ProfileRegistry:
+ """
+ Registry for store-specific OCR extraction profiles.
+
+ Uses class methods for singleton-like behavior without explicit instantiation.
+ Supports hot-reload via importlib.reload() for runtime updates.
+
+ Attributes:
+ _profiles: Maps CUI -> profile class (not instance)
+ _instances: Maps CUI -> profile instance (lazy, cleared on reload)
+ _last_reload: Timestamp of last reload
+ _loaded: Whether initial load has been performed
+ """
+
+ # Class-level storage (singleton pattern via class methods)
+ _profiles: Dict[str, Type["BaseStoreProfile"]] = {}
+ _instances: Dict[str, "BaseStoreProfile"] = {}
+ _last_reload: Optional[datetime] = None
+ _loaded: bool = False
+
+ # -------------------------------------------------------------------------
+ # Registration
+ # -------------------------------------------------------------------------
+
+ @classmethod
+ def register(cls, profile_class: Type["BaseStoreProfile"]) -> Type["BaseStoreProfile"]:
+ """
+ Decorator to register a store profile class.
+
+ Registers the profile for all CUIs in the class's CUI_LIST.
+ Safe for re-registration during hot-reload (overwrites existing).
+
+ Usage:
+ @ProfileRegistry.register
+ class LidlProfile(BaseStoreProfile):
+ CUI_LIST = ["22891860"]
+ ...
+
+ Args:
+ profile_class: Profile class to register
+
+ Returns:
+ The same class (allows use as decorator)
+
+ Raises:
+ ValueError: If CUI_LIST is empty
+ """
+ cui_list = getattr(profile_class, 'CUI_LIST', [])
+ store_name = getattr(profile_class, 'STORE_NAME', profile_class.__name__)
+
+ if not cui_list:
+ logger.warning(f"Profile {profile_class.__name__} has empty CUI_LIST, skipping")
+ return profile_class
+
+ # Register for each CUI
+ for cui in cui_list:
+ # Normalize CUI (remove RO prefix, strip whitespace)
+ normalized_cui = cls._normalize_cui(cui)
+
+ if normalized_cui in cls._profiles:
+ old_class = cls._profiles[normalized_cui]
+ logger.debug(
+ f"Re-registering CUI {normalized_cui}: "
+ f"{old_class.__name__} -> {profile_class.__name__}"
+ )
+ # Clear cached instance for this CUI
+ cls._instances.pop(normalized_cui, None)
+
+ cls._profiles[normalized_cui] = profile_class
+ logger.debug(f"Registered profile {profile_class.__name__} for CUI {normalized_cui}")
+
+ logger.info(f"Registered {store_name} for CUIs: {cui_list}")
+ return profile_class
+
+ # -------------------------------------------------------------------------
+ # Lookup
+ # -------------------------------------------------------------------------
+
+ @classmethod
+ def get_profile(cls, cui: Optional[str]) -> Optional["BaseStoreProfile"]:
+ """
+ Get profile instance for a CUI.
+
+ Uses lazy instantiation - creates instance on first access.
+ Returns None if no profile is registered for this CUI.
+
+ Args:
+ cui: CUI to lookup (with or without RO prefix)
+
+ Returns:
+ Profile instance or None
+ """
+ if not cui:
+ return None
+
+ # Ensure profiles are loaded
+ if not cls._loaded:
+ cls._load_all_profiles()
+
+ normalized_cui = cls._normalize_cui(cui)
+
+ # Check if profile exists
+ profile_class = cls._profiles.get(normalized_cui)
+ if not profile_class:
+ return None
+
+ # Lazy instantiation
+ if normalized_cui not in cls._instances:
+ try:
+ cls._instances[normalized_cui] = profile_class()
+ logger.debug(f"Instantiated {profile_class.__name__} for CUI {normalized_cui}")
+ except Exception as e:
+ logger.error(f"Failed to instantiate {profile_class.__name__}: {e}")
+ return None
+
+ return cls._instances[normalized_cui]
+
+ @classmethod
+ def has_profile(cls, cui: Optional[str]) -> bool:
+ """Check if a profile exists for this CUI."""
+ if not cui:
+ return False
+ if not cls._loaded:
+ cls._load_all_profiles()
+ return cls._normalize_cui(cui) in cls._profiles
+
+ # -------------------------------------------------------------------------
+ # Listing
+ # -------------------------------------------------------------------------
+
+ @classmethod
+ def list_profiles(cls) -> List[Dict]:
+ """
+ List all registered profiles.
+
+ Returns:
+ List of dicts with cui, class_name, store_name, name_patterns
+ """
+ if not cls._loaded:
+ cls._load_all_profiles()
+
+ result = []
+ seen_classes = set()
+
+ for cui, profile_class in cls._profiles.items():
+ # Avoid duplicates for profiles with multiple CUIs
+ if profile_class.__name__ in seen_classes:
+ continue
+ seen_classes.add(profile_class.__name__)
+
+ result.append({
+ "cuis": list(getattr(profile_class, 'CUI_LIST', [])),
+ "class_name": profile_class.__name__,
+ "store_name": getattr(profile_class, 'STORE_NAME', profile_class.__name__),
+ "name_patterns": list(getattr(profile_class, 'NAME_PATTERNS', [])),
+ })
+
+ return result
+
+ @classmethod
+ def get_profile_info(cls, cui: str) -> Optional[Dict]:
+ """
+ Get detailed info about a profile.
+
+ Args:
+ cui: CUI to lookup
+
+ Returns:
+ Dict with profile details or None
+ """
+ profile = cls.get_profile(cui)
+ if not profile:
+ return None
+
+ return {
+ "cui": cui,
+ "cuis": list(profile.CUI_LIST),
+ "class_name": profile.__class__.__name__,
+ "store_name": profile.STORE_NAME,
+ "name_patterns": list(profile.NAME_PATTERNS),
+ "validation_hints": profile.get_validation_hints(),
+ }
+
+ # -------------------------------------------------------------------------
+ # Hot-Reload
+ # -------------------------------------------------------------------------
+
+ @classmethod
+ def reload_all(cls) -> int:
+ """
+ Hot-reload all profile modules.
+
+ Clears instance cache and reloads all .py files in profiles directory.
+ Decorator re-registers classes with updated code.
+
+ Returns:
+ Number of modules reloaded
+ """
+ logger.info("Starting profile hot-reload...")
+
+ # Clear instance cache (will be recreated on next get_profile)
+ cls._instances.clear()
+
+ # Get list of profile modules (exclude __init__, base)
+ module_names = cls._get_profile_module_names()
+
+ count = 0
+ for module_name in module_names:
+ full_name = f"backend.modules.data_entry.services.ocr.profiles.{module_name}"
+
+ try:
+ if full_name in sys.modules:
+ # Reload existing module
+ importlib.reload(sys.modules[full_name])
+ logger.debug(f"Reloaded module: {module_name}")
+ else:
+ # Import new module
+ importlib.import_module(full_name)
+ logger.debug(f"Imported new module: {module_name}")
+ count += 1
+ except Exception as e:
+ logger.error(f"Failed to reload {module_name}: {e}")
+
+ cls._last_reload = datetime.utcnow()
+ cls._loaded = True
+
+ logger.info(f"Profile hot-reload complete: {count} modules, {len(cls._profiles)} profiles")
+ return count
+
+ @classmethod
+ def get_reload_status(cls) -> Dict:
+ """Get status of the registry including last reload time."""
+ return {
+ "loaded": cls._loaded,
+ "last_reload": cls._last_reload.isoformat() if cls._last_reload else None,
+ "profiles_count": len(cls._profiles),
+ "instances_count": len(cls._instances),
+ "registered_cuis": list(cls._profiles.keys()),
+ }
+
+ # -------------------------------------------------------------------------
+ # Internal methods
+ # -------------------------------------------------------------------------
+
+ @classmethod
+ def _normalize_cui(cls, cui: str) -> str:
+ """
+ Normalize CUI for consistent lookup.
+
+ - Removes RO prefix (with or without space)
+ - Strips whitespace
+ - Converts to uppercase
+
+ Args:
+ cui: Raw CUI string
+
+ Returns:
+ Normalized CUI (digits only)
+ """
+ if not cui:
+ return ""
+
+ cui = str(cui).strip().upper()
+
+ # Remove RO prefix (handles "RO12345" and "RO 12345")
+ if cui.startswith("RO"):
+ cui = cui[2:].lstrip()
+
+ return cui.strip()
+
+ @classmethod
+ def _get_profile_module_names(cls) -> List[str]:
+ """
+ Get list of profile module names from profiles directory.
+
+ Excludes __init__.py and base.py.
+
+ Returns:
+ List of module names (without .py extension)
+ """
+ excluded = {"__init__", "base", "__pycache__"}
+ modules = []
+
+ for path in PROFILES_DIR.glob("*.py"):
+ name = path.stem
+ if name not in excluded:
+ modules.append(name)
+
+ return sorted(modules)
+
+ @classmethod
+ def _load_all_profiles(cls) -> None:
+ """
+ Initial load of all profile modules.
+
+ Called automatically on first get_profile() if not already loaded.
+ """
+ if cls._loaded:
+ return
+
+ logger.info("Loading store profiles...")
+
+ module_names = cls._get_profile_module_names()
+
+ for module_name in module_names:
+ full_name = f"backend.modules.data_entry.services.ocr.profiles.{module_name}"
+ try:
+ importlib.import_module(full_name)
+ logger.debug(f"Loaded module: {module_name}")
+ except Exception as e:
+ logger.error(f"Failed to load {module_name}: {e}")
+
+ cls._loaded = True
+ cls._last_reload = datetime.utcnow()
+
+ logger.info(f"Loaded {len(cls._profiles)} store profiles")
+
+ @classmethod
+ def clear(cls) -> None:
+ """
+ Clear all registered profiles.
+
+ Mainly useful for testing.
+ """
+ cls._profiles.clear()
+ cls._instances.clear()
+ cls._loaded = False
+ cls._last_reload = None
+
+
+# -------------------------------------------------------------------------
+# Module exports
+# -------------------------------------------------------------------------
+
+__all__ = [
+ "ProfileRegistry",
+ "BaseStoreProfile",
+]
+
+# Re-export BaseStoreProfile for convenience
+from .base import BaseStoreProfile
diff --git a/backend/modules/data_entry/services/ocr/profiles/base.py b/backend/modules/data_entry/services/ocr/profiles/base.py
new file mode 100644
index 0000000..1dd718e
--- /dev/null
+++ b/backend/modules/data_entry/services/ocr/profiles/base.py
@@ -0,0 +1,515 @@
+"""
+Base class for store-specific OCR extraction profiles.
+
+Each store can have different receipt formats (TVA layout, total position, etc.).
+Store profiles allow customizing extraction logic per-store for better accuracy.
+
+Usage:
+ from .base import BaseStoreProfile
+ from . import ProfileRegistry
+
+ @ProfileRegistry.register
+ class LidlProfile(BaseStoreProfile):
+ CUI_LIST = ["22891860"]
+ NAME_PATTERNS = ["LIDL", "LDL"]
+
+ def extract_tva_entries(self, text: str) -> List[dict]:
+ # Custom Lidl TVA extraction logic
+ ...
+"""
+
+import re
+from abc import ABC
+from decimal import Decimal, InvalidOperation
+from typing import List, Optional, Tuple, Dict, Any
+from datetime import date
+
+
+class BaseStoreProfile(ABC):
+ """
+ Abstract base class for store-specific extraction profiles.
+
+ Each profile defines:
+ - CUI_LIST: CUI codes that identify this store (without RO prefix)
+ - NAME_PATTERNS: OCR-tolerant name patterns for fallback matching
+ - Custom extraction methods for TVA, total, date, etc.
+
+ The ProfileRegistry uses CUI_LIST to lookup profiles during extraction.
+ """
+
+ # -------------------------------------------------------------------------
+ # Class attributes - override in subclasses
+ # -------------------------------------------------------------------------
+
+ # List of CUI codes (without RO prefix) that identify this store
+ CUI_LIST: List[str] = []
+
+ # OCR-tolerant name patterns for fallback matching
+ NAME_PATTERNS: List[str] = []
+
+ # Store display name
+ STORE_NAME: str = "Unknown Store"
+
+ # -------------------------------------------------------------------------
+ # Generic patterns - can be overridden in subclasses
+ # -------------------------------------------------------------------------
+
+ # Total amount patterns (confidence-weighted)
+ TOTAL_PATTERNS = [
+ (r'T[O0]TAL[.\s]+L[E3][I1!]\s*:?\s*([\d\s.,]+)', 0.98),
+ (r'TOTAL\s+LEI\s*([\d\s.,]+)', 0.98),
+ (r'[OT]?OTAL\s+LEI\s*([\d\s.,]+)', 0.95),
+ (r'TOTAL\s*:?\s*([\d\s.,]+)\s*(?:RON|LEI)?', 0.95),
+ (r'TOTAL\s+(?:RON|LEI)\s*([\d\s.,]+)', 0.95),
+ (r'SUBTOTAL\s*([\d\s.,]+)', 0.90),
+ (r'DE\s+PLATA\s*:?\s*([\d\s.,]+)', 0.90),
+ (r'SUMA\s*:?\s*([\d\s.,]+)', 0.85),
+ ]
+
+ # Date patterns (confidence-weighted)
+ DATE_PATTERNS = [
+ (r'D[AR]TA\s*:?\s*(\d{2}[-./]\d{2}[-./]\d{4})', 0.98),
+ (r'DATA\s*:?\s*(\d{2}[-./]\d{2}[-./]\d{4})', 0.98),
+ (r'(\d{2}[-./]\d{2}[-./]\d{4})\s+[O0]RA\s*:?\s*\d{2}:\d{2}', 0.95),
+ (r'(\d{2}[-./]\d{2}[-./]\d{4})\s+\d{2}:\d{2}', 0.90),
+ (r'(\d{2}[-./]\d{2}[-./]\d{4})', 0.80),
+ (r'(\d{4}[-./]\d{2}[-./]\d{2})', 0.75),
+ ]
+
+ # Date patterns with OCR-introduced spaces (separate because format is different)
+ DATE_PATTERNS_OCR_SPACES = [
+ (r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})\s+\d{2}:\d{2}', 0.92, 'ymd'),
+ (r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})', 0.85, 'ymd'),
+ (r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})\s+\d{2}:\d{2}', 0.92, 'dmy'),
+ (r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})', 0.85, 'dmy'),
+ ]
+
+ # Receipt number patterns (confidence-weighted)
+ NUMBER_PATTERNS = [
+ (r'NDS\s*:?\s*(\d+)', 0.98),
+ (r'C3POS[-A-Z0-9]*[N:](\d{6,7})', 0.98),
+ (r'C3POS.*?(\d{6,7})\b', 0.95),
+ (r'BF\s*:\s*(\d{4,})', 0.96),
+ (r'BF\s+(\d{4,})', 0.93),
+ (r'NIVS\s*:?\s*(\d+)', 0.95),
+ (r'NR\.?\s*BON\s*:?\s*(\d+)', 0.95),
+ (r'BON\s+(?:FISCAL\s+)?NR\.?\s*:?\s*(\d+)', 0.95),
+ (r'CHITANTA\s+NR\.?\s*:?\s*(\d+)', 0.95),
+ (r'NR\.?\s+DOCUMENT\s*:?\s*(\d+)', 0.90),
+ (r'ID\s*BF\s*:?\s*(\d+)', 0.90),
+ ]
+
+ # Payment method patterns (pattern, method_type, confidence)
+ PAYMENT_PATTERNS = [
+ (r'CARTE\s+CREDIT\s*:?\s*([\d\s.,]+)', 'CARD', 0.98),
+ (r'CARTE\s+CREDIT\s*:?\s*\n\s*([\d\s.,]+)', 'CARD', 0.97),
+ (r'(?:PLATA\s+)?CARD\s*[:\sA-Z]?\s*([\d\s.,]+)', 'CARD', 0.95),
+ (r'NUMERAR\s*:?\s*([\d\s.,]+)', 'NUMERAR', 0.95),
+ (r'CASH\s*:?\s*([\d\s.,]+)', 'NUMERAR', 0.90),
+ (r'(?:^|\n|\s)RD\s*:?\s*(\d{1,6}[.,]\d{2})\b', 'CARD', 0.70),
+ (r'(?:^|\n|\s)ARD\s*:?\s*(\d{1,6}[.,]\d{2})\b', 'CARD', 0.75),
+ (r'(?:^|\n|\s)MERAR\s*:?\s*(\d{1,6}[.,]\d{2})\b', 'NUMERAR', 0.70),
+ ]
+
+ # Client section markers (for B2B receipts)
+ CLIENT_MARKERS = [
+ r'C\.?\s*[I1]\.?\s*F\.?\s+CLIENT\s*:',
+ r'C\.?\s*U\.?\s*[I1]\.?\s+CLIENT\s*:',
+ r'CLIENT\s+C\.?\s*[UI1]\.?\s*[IF1]\.?\s*:',
+ r'CLIENT\s*:',
+ r'CUMPARATOR\s*:',
+ r'BENEFICIAR\s*:',
+ ]
+
+ # Client CUI patterns (pattern, confidence)
+ CLIENT_CUI_PATTERNS = [
+ (r'(R[O0]\d{6,10})\s*\n\s*CLIENT\s+C\.?\s*U\.?\s*[I1]\.?', 0.99),
+ (r'(R[O0]\d{6,10})\s*:?\s*\n\s*CLIENT', 0.98),
+ (r'C[I1]F\s+[A-Z]*\s*CLIENT\s*:?\s*(R[O0]\d{6,10})', 0.98),
+ (r'C\.?\s*[I1]\.?\s*F\.?\s+CLIENT\s*:?\s*(R[O0]?\d{6,10})', 0.98),
+ (r'C\.?\s*U\.?\s*[I1]\.?\s+CLIENT\s*:?\s*(R[O0]?\d{6,10})', 0.98),
+ (r'CLIENT\s+C\.?\s*U\.?\s*[I1]\.?\s*:?\s*(R[O0]?\d{6,10})', 0.95),
+ (r'CLIENT\s*:?\s*(R[O0]?\d{6,10})', 0.90),
+ ]
+
+ # Company type indicators (for identifying company names)
+ COMPANY_INDICATORS = [
+ r'\bS\.?\s*R\.?\s*L\.?\b', # S.R.L. or S. R. L.
+ r'\bS\.?\s*A\.?\b', # S.A. or S. A.
+ r'\bS\.?\s*N\.?\s*C\.?\b', # S.N.C. or S. N. C.
+ r'\bS\.?\s*C\.?\s*S\.?\b', # S.C.S. or S. C. S.
+ r'\bI\.?\s*I\.?\b', # I.I. or I. I.
+ r'\bP\.?\s*F\.?\s*A\.?\b', # P.F.A. or P. F. A.
+ r'\bS\.?\s*C\.?\s+[A-Z]', # S.C. followed by company name
+ r'HOLDING',
+ r'COMPANY',
+ r'GROUP',
+ ]
+
+ # Maximum reasonable payment amount (to filter OCR errors)
+ MAX_PAYMENT = Decimal('100000')
+
+ # -------------------------------------------------------------------------
+ # Extraction methods - override in subclasses as needed
+ # -------------------------------------------------------------------------
+
+ def extract_tva_entries(self, text: str) -> List[dict]:
+ """
+ Extract TVA entries from receipt text.
+
+ Override this method in subclasses to handle store-specific TVA formats.
+
+ Args:
+ text: Raw OCR text from receipt
+
+ Returns:
+ List of dicts with keys: code, percent, amount
+ """
+ return []
+
+ def extract_total(self, text: str) -> Tuple[Optional[Decimal], float]:
+ """
+ Extract total amount from receipt text.
+
+ Args:
+ text: Raw OCR text from receipt
+
+ Returns:
+ Tuple of (amount, confidence) or (None, 0.0)
+ """
+ text_upper = text.upper()
+
+ for pattern, confidence in self.TOTAL_PATTERNS:
+ match = re.search(pattern, text_upper)
+ if match:
+ amount = self._parse_decimal(match.group(1))
+ if amount and amount > 0 and amount < self.MAX_PAYMENT:
+ return (amount, confidence)
+
+ return (None, 0.0)
+
+ def extract_date(self, text: str) -> Tuple[Optional[date], float]:
+ """
+ Extract receipt date from text.
+
+ Args:
+ text: Raw OCR text from receipt
+
+ Returns:
+ Tuple of (date, confidence) or (None, 0.0)
+ """
+ text_upper = text.upper()
+
+ # Try standard patterns first
+ for pattern, confidence in self.DATE_PATTERNS:
+ match = re.search(pattern, text_upper)
+ if match:
+ parsed = self._parse_date(match.group(1))
+ if parsed:
+ return (parsed, confidence)
+
+ # Try OCR-corrupted patterns with spaces
+ for pattern, confidence, fmt in self.DATE_PATTERNS_OCR_SPACES:
+ match = re.search(pattern, text_upper)
+ if match:
+ try:
+ if fmt == 'ymd':
+ year, month, day = int(match.group(1)), int(match.group(2)), int(match.group(3))
+ else: # dmy
+ day, month, year = int(match.group(1)), int(match.group(2)), int(match.group(3))
+
+ if 1 <= day <= 31 and 1 <= month <= 12 and 2000 <= year <= 2100:
+ return (date(year, month, day), confidence)
+ except (ValueError, TypeError):
+ continue
+
+ return (None, 0.0)
+
+ def extract_receipt_number(self, text: str) -> Tuple[Optional[str], float]:
+ """
+ Extract receipt number from text.
+
+ Args:
+ text: Raw OCR text from receipt
+
+ Returns:
+ Tuple of (number, confidence) or (None, 0.0)
+ """
+ text_upper = text.upper()
+
+ for pattern, confidence in self.NUMBER_PATTERNS:
+ match = re.search(pattern, text_upper)
+ if match:
+ number = match.group(1).strip()
+ if number and len(number) >= 3:
+ return (number, confidence)
+
+ return (None, 0.0)
+
+ def extract_payment_methods(self, text: str) -> List[dict]:
+ """
+ Extract payment methods (CARD/NUMERAR) from receipt.
+
+ Supports multiple payments of the same type (e.g., 2x CARD for split payments).
+ Each payment is returned as a separate entry with its amount.
+
+ Args:
+ text: Raw OCR text from receipt
+
+ Returns:
+ List of dicts: [{'method': 'CARD'/'NUMERAR', 'amount': Decimal, 'confidence': float}]
+ Multiple entries of same method type are allowed for split payments.
+ """
+ text_upper = text.upper()
+ methods = []
+ # Track (method, amount) pairs to avoid exact duplicates from overlapping patterns
+ seen_entries = set()
+
+ for pattern, method, confidence in self.PAYMENT_PATTERNS:
+ for match in re.finditer(pattern, text_upper):
+ try:
+ amount = self._parse_decimal(match.group(1))
+ if amount and amount > 0 and amount < self.MAX_PAYMENT:
+ # Deduplicate by (method, amount) to avoid same entry from multiple patterns
+ # But allow different amounts for same method (split payments)
+ entry_key = (method, amount)
+ if entry_key not in seen_entries:
+ methods.append({
+ 'method': method,
+ 'amount': amount,
+ 'confidence': confidence
+ })
+ seen_entries.add(entry_key)
+ except (ValueError, InvalidOperation):
+ continue
+
+ return methods
+
+ def extract_client_cui(self, text: str) -> Tuple[Optional[str], float]:
+ """
+ Extract client CUI from B2B receipts.
+
+ Args:
+ text: Raw OCR text from receipt
+
+ Returns:
+ Tuple of (cui, confidence) or (None, 0.0)
+ """
+ text_upper = text.upper()
+
+ # First check if there's a CLIENT section
+ has_client_section = any(
+ re.search(marker, text_upper, re.IGNORECASE)
+ for marker in self.CLIENT_MARKERS
+ )
+
+ if not has_client_section:
+ return (None, 0.0)
+
+ # Try to extract CUI
+ for pattern, confidence in self.CLIENT_CUI_PATTERNS:
+ match = re.search(pattern, text_upper, re.IGNORECASE | re.MULTILINE)
+ if match:
+ cui = match.group(1)
+ # Normalize: remove RO prefix for storage
+ cui_digits = re.sub(r'[^0-9]', '', cui)
+ if 6 <= len(cui_digits) <= 10:
+ return (cui_digits, confidence)
+
+ return (None, 0.0)
+
+ def extract_client_name(self, text: str) -> Tuple[Optional[str], float]:
+ """
+ Extract client/buyer company name from B2B receipts.
+
+ Args:
+ text: Raw OCR text from receipt
+
+ Returns:
+ Tuple of (client_name, confidence) or (None, 0.0)
+ """
+ text_upper = text.upper()
+ lines = text.split('\n')
+
+ # First check if there's a CLIENT section
+ client_section_idx = None
+ for i, line in enumerate(lines):
+ line_upper = line.upper().strip()
+ if any(re.search(marker, line_upper, re.IGNORECASE) for marker in self.CLIENT_MARKERS):
+ client_section_idx = i
+ break
+
+ if client_section_idx is None:
+ return (None, 0.0)
+
+ # Look for company name in CLIENT section
+ line = lines[client_section_idx].strip()
+ line_upper = line.upper()
+
+ # Strategy 1: Check if name is on same line after ":"
+ if ':' in line:
+ name_part = line.split(':', 1)[1].strip()
+ if name_part and len(name_part) >= 3:
+ # Skip if it looks like a CUI (RO followed by digits)
+ if re.match(r'^R[O0]?\d{6,10}$', name_part.upper()):
+ pass # This is CUI, not name - continue to next strategy
+ else:
+ # Check for company indicators
+ name_upper = name_part.upper()
+ if any(re.search(ind, name_upper) for ind in self.COMPANY_INDICATORS):
+ return (self._clean_company_name(name_part), 0.95)
+ elif len(name_part) >= 5 and not name_part.isdigit():
+ return (self._clean_company_name(name_part), 0.80)
+
+ # Strategy 2: Check next line for company name
+ if client_section_idx + 1 < len(lines):
+ next_line = lines[client_section_idx + 1].strip()
+ next_upper = next_line.upper()
+
+ # Skip if it's a CUI/CIF line or looks like CUI
+ if not re.search(r'C\.?\s*[UI]\.?\s*[IF]\.?', next_upper):
+ if not re.match(r'^R[O0]?\d{6,10}$', next_upper):
+ if any(re.search(ind, next_upper) for ind in self.COMPANY_INDICATORS):
+ return (self._clean_company_name(next_line), 0.90)
+ elif len(next_line) >= 5 and not next_line.isdigit():
+ # Check it's not CUI/CIF/COD keywords
+ if not any(kw in next_upper for kw in ['CUI', 'CIF', 'COD', 'FISCAL']):
+ return (self._clean_company_name(next_line), 0.75)
+
+ # Strategy 3: Look for any line with company indicators in CLIENT section region
+ search_end = min(client_section_idx + 5, len(lines))
+ for i in range(client_section_idx + 1, search_end):
+ line = lines[i].strip()
+ line_upper = line.upper()
+
+ # Skip CUI/CIF lines
+ if re.search(r'C\.?\s*[UI]\.?\s*[IF]\.?', line_upper):
+ continue
+ if re.match(r'^R[O0]?\d{6,10}$', line_upper):
+ continue
+
+ if any(re.search(ind, line_upper) for ind in self.COMPANY_INDICATORS):
+ return (self._clean_company_name(line), 0.85)
+
+ return (None, 0.0)
+
+ @staticmethod
+ def _clean_company_name(name: str) -> str:
+ """Clean company name for storage."""
+ if not name:
+ return ""
+ # Remove extra whitespace
+ name = re.sub(r'\s+', ' ', name).strip()
+ # Remove trailing punctuation except periods in S.R.L., S.A., etc.
+ name = re.sub(r'[,;:]+$', '', name).strip()
+ return name
+
+ # -------------------------------------------------------------------------
+ # Validation hints - override to customize validation behavior
+ # -------------------------------------------------------------------------
+
+ def get_validation_hints(self) -> Dict[str, Any]:
+ """
+ Return validation hints for this store.
+
+ Returns:
+ Dict with validation hints. Common keys:
+ - has_multi_rate_tva: bool - Store uses multiple TVA rates
+ - card_equals_total: bool - CARD payment equals total
+ - has_client_cui: bool - Receipt includes client CUI
+ - has_efactura: bool - Store uses e-factura format
+ - is_non_vat_payer: bool - Store is not a VAT payer
+ """
+ return {}
+
+ # -------------------------------------------------------------------------
+ # Helper methods - available to all subclasses
+ # -------------------------------------------------------------------------
+
+ @staticmethod
+ def _normalize_number(text: str) -> str:
+ """
+ Normalize a number string for Decimal conversion.
+
+ Handles Romanian formats: "1.234,56" -> "1234.56"
+ """
+ if not text:
+ return "0"
+
+ # Remove spaces
+ text = text.replace(" ", "")
+
+ # Determine decimal separator
+ last_comma = text.rfind(",")
+ last_dot = text.rfind(".")
+
+ if last_comma > last_dot:
+ text = text.replace(".", "").replace(",", ".")
+ elif last_dot > last_comma:
+ text = text.replace(",", "")
+ else:
+ text = text.replace(",", ".")
+
+ return text
+
+ @staticmethod
+ def _parse_decimal(text: str) -> Optional[Decimal]:
+ """Parse a string to Decimal, handling various formats."""
+ try:
+ normalized = BaseStoreProfile._normalize_number(text)
+ return Decimal(normalized)
+ except (InvalidOperation, ValueError, TypeError):
+ return None
+
+ @staticmethod
+ def _parse_date(text: str) -> Optional[date]:
+ """
+ Parse date string in various formats.
+
+ Supports: DD-MM-YYYY, DD/MM/YYYY, DD.MM.YYYY, YYYY-MM-DD
+ """
+ if not text:
+ return None
+
+ # Normalize separators
+ text = text.replace('/', '-').replace('.', '-')
+
+ try:
+ parts = text.split('-')
+ if len(parts) != 3:
+ return None
+
+ # Determine format based on first part length
+ if len(parts[0]) == 4:
+ # YYYY-MM-DD
+ year, month, day = int(parts[0]), int(parts[1]), int(parts[2])
+ else:
+ # DD-MM-YYYY
+ day, month, year = int(parts[0]), int(parts[1]), int(parts[2])
+
+ # Validate ranges
+ if 1 <= day <= 31 and 1 <= month <= 12 and 2000 <= year <= 2100:
+ return date(year, month, day)
+ except (ValueError, TypeError, IndexError):
+ pass
+
+ return None
+
+ @staticmethod
+ def _clean_text(text: str) -> str:
+ """Clean OCR text for pattern matching."""
+ if not text:
+ return ""
+ text = re.sub(r'\s+', ' ', text)
+ text = re.sub(r'[\x00-\x09\x0b\x0c\x0e-\x1f\x7f]', '', text)
+ return text.strip()
+
+ # -------------------------------------------------------------------------
+ # Magic methods
+ # -------------------------------------------------------------------------
+
+ def __repr__(self) -> str:
+ return f"<{self.__class__.__name__} CUI={self.CUI_LIST}>"
+
+ def __str__(self) -> str:
+ return f"{self.STORE_NAME} ({', '.join(self.CUI_LIST)})"
diff --git a/backend/modules/data_entry/services/ocr/profiles/best_print.py b/backend/modules/data_entry/services/ocr/profiles/best_print.py
new file mode 100644
index 0000000..16cb427
--- /dev/null
+++ b/backend/modules/data_entry/services/ocr/profiles/best_print.py
@@ -0,0 +1,54 @@
+"""
+BEST PRINT TRADE ACTIV SRL store profile for OCR extraction.
+
+Stamp manufacturing service. Non-VAT payer (neplătitor de TVA).
+"""
+
+from typing import List, Dict, Any
+
+from .base import BaseStoreProfile
+from . import ProfileRegistry
+
+
+@ProfileRegistry.register
+class BestPrintProfile(BaseStoreProfile):
+ """
+ BEST PRINT TRADE ACTIV SRL - non-VAT payer profile.
+
+ Key characteristics:
+ - Non-VAT payer (neplătitor de TVA) - NO TVA on receipts
+ - Stamp manufacturing and printing services
+ - Total amount has no TVA component
+ - CARD payment typical
+ """
+
+ CUI_LIST = ["45417955"]
+ NAME_PATTERNS = ["BEST PRINT", "BESTPRINT", "BEST PRINT TRADE", "BEST PR1NT"]
+ STORE_NAME = "BEST PRINT TRADE ACTIV SRL"
+
+ def extract_tva_entries(self, text: str) -> List[dict]:
+ """
+ Extract TVA entries - returns empty for non-VAT payer.
+
+ BEST PRINT is a non-VAT payer (neplătitor de TVA),
+ so no TVA entries are expected on receipts.
+
+ Args:
+ text: Raw OCR text from receipt (unused)
+
+ Returns:
+ Empty list (non-VAT payer has no TVA)
+ """
+ # Non-VAT payer - no TVA entries
+ return []
+
+ def get_validation_hints(self) -> Dict[str, Any]:
+ """Return BEST PRINT-specific validation hints."""
+ return {
+ "has_multi_rate_tva": False,
+ "card_equals_total": True,
+ "has_client_cui": True, # May have client CUI
+ "has_efactura": False,
+ "is_non_vat_payer": True, # CRITICAL: Non-VAT payer
+ "tva_pattern": "none",
+ }
diff --git a/backend/modules/data_entry/services/ocr/profiles/brick.py b/backend/modules/data_entry/services/ocr/profiles/brick.py
new file mode 100644
index 0000000..468d76a
--- /dev/null
+++ b/backend/modules/data_entry/services/ocr/profiles/brick.py
@@ -0,0 +1,101 @@
+"""
+BRICK (Five-Holding) store profile for OCR extraction.
+
+Five-Holding S.A. operates BRICK stores with standard receipt format.
+"""
+
+import re
+from decimal import Decimal, InvalidOperation
+from typing import List, Dict, Any
+
+from .base import BaseStoreProfile
+from . import ProfileRegistry
+
+
+@ProfileRegistry.register
+class BrickProfile(BaseStoreProfile):
+ """
+ FIVE-HOLDING S.A. (BRICK) - standard TVA format.
+
+ Key characteristics:
+ - Standard TVA format
+ - Single TVA rate typically
+ - No client CUI on receipts
+ """
+
+ CUI_LIST = ["10562600"]
+ NAME_PATTERNS = ["BRICK", "FIVE-HOLDING", "FIVE HOLDING", "BR1CK"] # OCR variants
+ STORE_NAME = "FIVE-HOLDING S.A."
+
+ # Standard TVA patterns (flexible - accepts any rate)
+ TVA_PATTERNS = [
+ # "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
+ r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
+ # "A - XX,XX% = YY,YY"
+ r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
+ # Simple: "TVA XX% YY,YY"
+ r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
+ ]
+
+ def extract_tva_entries(self, text: str) -> List[dict]:
+ """
+ Extract BRICK-specific TVA entries.
+
+ Args:
+ text: Raw OCR text from receipt
+
+ Returns:
+ List of TVA entries with code, percent, and amount
+ """
+ entries = []
+ seen = set()
+
+ # Try coded patterns first
+ for pattern in self.TVA_PATTERNS[:2]:
+ for match in re.finditer(pattern, text, re.IGNORECASE):
+ try:
+ code = match.group(1).upper()
+ percent = int(match.group(2))
+ amount = self._parse_decimal(match.group(3))
+
+ if amount and amount > 0:
+ entry_key = (code, percent)
+ if entry_key not in seen:
+ entries.append({
+ 'code': code,
+ 'percent': percent,
+ 'amount': amount
+ })
+ seen.add(entry_key)
+ except (ValueError, InvalidOperation, IndexError):
+ continue
+
+ # Fallback to simple format
+ if not entries:
+ simple_pattern = self.TVA_PATTERNS[2]
+ for match in re.finditer(simple_pattern, text, re.IGNORECASE):
+ try:
+ percent = int(match.group(1))
+ amount = self._parse_decimal(match.group(2))
+
+ if amount and amount > 0:
+ entries.append({
+ 'code': 'A',
+ 'percent': percent,
+ 'amount': amount
+ })
+ break
+ except (ValueError, InvalidOperation):
+ continue
+
+ return entries
+
+ def get_validation_hints(self) -> Dict[str, Any]:
+ """Return BRICK-specific validation hints."""
+ return {
+ "has_multi_rate_tva": False,
+ "card_equals_total": False,
+ "has_client_cui": False,
+ "has_efactura": False,
+ "is_non_vat_payer": False,
+ }
diff --git a/backend/modules/data_entry/services/ocr/profiles/dedeman.py b/backend/modules/data_entry/services/ocr/profiles/dedeman.py
new file mode 100644
index 0000000..9d6c97e
--- /dev/null
+++ b/backend/modules/data_entry/services/ocr/profiles/dedeman.py
@@ -0,0 +1,118 @@
+"""
+DEDEMAN store profile for OCR extraction.
+
+Dedeman receipts may include e-factura information and use standard TVA format.
+Large DIY retailer in Romania.
+"""
+
+import re
+from decimal import Decimal, InvalidOperation
+from typing import List, Dict, Any
+
+from .base import BaseStoreProfile
+from . import ProfileRegistry
+
+
+@ProfileRegistry.register
+class DedemanProfile(BaseStoreProfile):
+ """
+ DEDEMAN SRL - standard TVA with e-factura support.
+
+ Key characteristics:
+ - Standard TVA format
+ - May include e-factura reference number
+ - Professional receipts for construction materials
+ """
+
+ CUI_LIST = ["2816464"]
+ NAME_PATTERNS = ["DEDEMAN", "DEDEMAN SRL", "OEDEMAN", "D3DEMAN"] # OCR variants
+ STORE_NAME = "DEDEMAN SRL"
+
+ # Standard TVA patterns (flexible - accepts any rate)
+ TVA_PATTERNS = [
+ # "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
+ r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
+ # "A - XX,XX% = YY,YY"
+ r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
+ # "TVA (XX%) YY,YY"
+ r'TVA\s*\(?\s*(\d{1,2})\s*%\s*\)?\s*:?\s*([\d.,]+)',
+ ]
+
+ # E-factura pattern for reference extraction
+ EFACTURA_PATTERN = r'e-?factura\s*:?\s*([A-Z0-9]+)'
+
+ def extract_tva_entries(self, text: str) -> List[dict]:
+ """
+ Extract Dedeman-specific TVA entries.
+
+ Args:
+ text: Raw OCR text from receipt
+
+ Returns:
+ List of TVA entries with code, percent, and amount
+ """
+ entries = []
+ seen = set()
+
+ # Try coded patterns first
+ for pattern in self.TVA_PATTERNS[:2]:
+ for match in re.finditer(pattern, text, re.IGNORECASE):
+ try:
+ code = match.group(1).upper()
+ percent = int(match.group(2))
+ amount = self._parse_decimal(match.group(3))
+
+ if amount and amount > 0:
+ entry_key = (code, percent)
+ if entry_key not in seen:
+ entries.append({
+ 'code': code,
+ 'percent': percent,
+ 'amount': amount
+ })
+ seen.add(entry_key)
+ except (ValueError, InvalidOperation, IndexError):
+ continue
+
+ # Fallback to simple format
+ if not entries:
+ simple_pattern = self.TVA_PATTERNS[2]
+ for match in re.finditer(simple_pattern, text, re.IGNORECASE):
+ try:
+ percent = int(match.group(1))
+ amount = self._parse_decimal(match.group(2))
+
+ if amount and amount > 0:
+ entries.append({
+ 'code': 'A',
+ 'percent': percent,
+ 'amount': amount
+ })
+ break
+ except (ValueError, InvalidOperation):
+ continue
+
+ return entries
+
+ def extract_efactura_reference(self, text: str) -> str | None:
+ """
+ Extract e-factura reference number if present.
+
+ Args:
+ text: Raw OCR text from receipt
+
+ Returns:
+ E-factura reference string or None
+ """
+ match = re.search(self.EFACTURA_PATTERN, text, re.IGNORECASE)
+ return match.group(1) if match else None
+
+ def get_validation_hints(self) -> Dict[str, Any]:
+ """Return Dedeman-specific validation hints."""
+ return {
+ "has_multi_rate_tva": False,
+ "card_equals_total": False,
+ "has_client_cui": False,
+ "has_efactura": True,
+ "is_non_vat_payer": False,
+ }
diff --git a/backend/modules/data_entry/services/ocr/profiles/electrobering.py b/backend/modules/data_entry/services/ocr/profiles/electrobering.py
new file mode 100644
index 0000000..ec08f59
--- /dev/null
+++ b/backend/modules/data_entry/services/ocr/profiles/electrobering.py
@@ -0,0 +1,102 @@
+"""
+ELECTROBERING S.R.L. store profile for OCR extraction.
+
+Electronics and home supplies store.
+"""
+
+import re
+from decimal import Decimal, InvalidOperation
+from typing import List, Dict, Any
+
+from .base import BaseStoreProfile
+from . import ProfileRegistry
+
+
+@ProfileRegistry.register
+class ElectroberingProfile(BaseStoreProfile):
+ """
+ ELECTROBERING S.R.L. - standard TVA profile.
+
+ Key characteristics:
+ - Standard TVA format (single rate, any percentage)
+ - Electronics and home supplies
+ - May have client CUI for B2B purchases
+ - CARD payment typical
+ """
+
+ CUI_LIST = ["2744937"]
+ NAME_PATTERNS = ["ELECTROBERING", "ELECTR0BERING", "ELECTROBERING SRL"]
+ STORE_NAME = "ELECTROBERING S.R.L."
+
+ # Standard TVA patterns (flexible - accepts any rate)
+ TVA_PATTERNS = [
+ # "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
+ r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
+ # "A - XX,XX% = YY,YY"
+ r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
+ # "TVA XX% YY,YY" (simple format without code)
+ r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
+ ]
+
+ def extract_tva_entries(self, text: str) -> List[dict]:
+ """
+ Extract TVA entries from receipt text.
+
+ Args:
+ text: Raw OCR text from receipt
+
+ Returns:
+ List of TVA entries with code, percent, and amount
+ """
+ entries = []
+ seen = set()
+
+ # Try coded patterns first
+ for pattern in self.TVA_PATTERNS[:2]:
+ for match in re.finditer(pattern, text, re.IGNORECASE):
+ try:
+ code = match.group(1).upper()
+ percent = int(match.group(2))
+ amount = self._parse_decimal(match.group(3))
+
+ if amount and amount > 0:
+ entry_key = (code, percent)
+ if entry_key not in seen:
+ entries.append({
+ 'code': code,
+ 'percent': percent,
+ 'amount': amount
+ })
+ seen.add(entry_key)
+ except (ValueError, InvalidOperation, IndexError):
+ continue
+
+ # Fallback to simple format
+ if not entries:
+ simple_pattern = self.TVA_PATTERNS[2]
+ for match in re.finditer(simple_pattern, text, re.IGNORECASE):
+ try:
+ percent = int(match.group(1))
+ amount = self._parse_decimal(match.group(2))
+
+ if amount and amount > 0:
+ entries.append({
+ 'code': 'A',
+ 'percent': percent,
+ 'amount': amount
+ })
+ break
+ except (ValueError, InvalidOperation):
+ continue
+
+ return entries
+
+ def get_validation_hints(self) -> Dict[str, Any]:
+ """Return ELECTROBERING-specific validation hints."""
+ return {
+ "has_multi_rate_tva": False,
+ "card_equals_total": True,
+ "has_client_cui": True, # May have client CUI for B2B
+ "has_efactura": False,
+ "is_non_vat_payer": False,
+ }
diff --git a/backend/modules/data_entry/services/ocr/profiles/gama_ink.py b/backend/modules/data_entry/services/ocr/profiles/gama_ink.py
new file mode 100644
index 0000000..8fd4bff
--- /dev/null
+++ b/backend/modules/data_entry/services/ocr/profiles/gama_ink.py
@@ -0,0 +1,103 @@
+"""
+GAMA INK SERVICE SRL store profile for OCR extraction.
+
+Toner refill and printer supplies store.
+"""
+
+import re
+from decimal import Decimal, InvalidOperation
+from typing import List, Dict, Any
+
+from .base import BaseStoreProfile
+from . import ProfileRegistry
+
+
+@ProfileRegistry.register
+class GamaInkProfile(BaseStoreProfile):
+ """
+ GAMA INK SERVICE SRL - standard TVA profile.
+
+ Key characteristics:
+ - Standard TVA format (single rate, any percentage)
+ - Service-based (toner refill, printer supplies)
+ - CARD payment typical
+ """
+
+ CUI_LIST = ["17741882"]
+ NAME_PATTERNS = ["GAMA INK", "GAMA", "GAMAINK", "GAMA INK SERVICE"]
+ STORE_NAME = "GAMA INK SERVICE SRL"
+
+ # Standard TVA patterns (flexible - accepts any rate)
+ TVA_PATTERNS = [
+ # "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
+ r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
+ # "A - XX,XX% = YY,YY"
+ r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
+ # "TVA XX% YY,YY" (simple format without code)
+ r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
+ # "TVA: YY,YY" (amount only, percent inferred)
+ r'TVA\s*:?\s*([\d.,]+)\s*(?:LEI|RON)?',
+ ]
+
+ def extract_tva_entries(self, text: str) -> List[dict]:
+ """
+ Extract TVA entries from receipt text.
+
+ Args:
+ text: Raw OCR text from receipt
+
+ Returns:
+ List of TVA entries with code, percent, and amount
+ """
+ entries = []
+ seen = set()
+
+ # Try coded patterns first (have both code and percent)
+ for pattern in self.TVA_PATTERNS[:2]:
+ for match in re.finditer(pattern, text, re.IGNORECASE):
+ try:
+ code = match.group(1).upper()
+ percent = int(match.group(2))
+ amount = self._parse_decimal(match.group(3))
+
+ if amount and amount > 0:
+ entry_key = (code, percent)
+ if entry_key not in seen:
+ entries.append({
+ 'code': code,
+ 'percent': percent,
+ 'amount': amount
+ })
+ seen.add(entry_key)
+ except (ValueError, InvalidOperation, IndexError):
+ continue
+
+ # Fallback to simple format (percent + amount without code)
+ if not entries:
+ simple_pattern = self.TVA_PATTERNS[2]
+ for match in re.finditer(simple_pattern, text, re.IGNORECASE):
+ try:
+ percent = int(match.group(1))
+ amount = self._parse_decimal(match.group(2))
+
+ if amount and amount > 0:
+ entries.append({
+ 'code': 'A',
+ 'percent': percent,
+ 'amount': amount
+ })
+ break
+ except (ValueError, InvalidOperation):
+ continue
+
+ return entries
+
+ def get_validation_hints(self) -> Dict[str, Any]:
+ """Return GAMA INK-specific validation hints."""
+ return {
+ "has_multi_rate_tva": False,
+ "card_equals_total": True,
+ "has_client_cui": False,
+ "has_efactura": False,
+ "is_non_vat_payer": False,
+ }
diff --git a/backend/modules/data_entry/services/ocr/profiles/kineterra.py b/backend/modules/data_entry/services/ocr/profiles/kineterra.py
new file mode 100644
index 0000000..1d787e0
--- /dev/null
+++ b/backend/modules/data_entry/services/ocr/profiles/kineterra.py
@@ -0,0 +1,53 @@
+"""
+KINETERRA store profile for OCR extraction.
+
+Kineterra is a non-VAT payer (neplătitor de TVA).
+Receipts don't include TVA breakdown.
+"""
+
+from typing import List, Dict, Any
+
+from .base import BaseStoreProfile
+from . import ProfileRegistry
+
+
+@ProfileRegistry.register
+class KineterraProfile(BaseStoreProfile):
+ """
+ KINETERRA CONCEPT SRL - non-VAT payer profile.
+
+ Key characteristics:
+ - Non-VAT payer (neplătitor de TVA)
+ - No TVA breakdown on receipts
+ - Total amount has no TVA component
+ """
+
+ CUI_LIST = ["31180432"]
+ NAME_PATTERNS = ["KINETERRA", "KINETERRA CONCEPT", "K1NETERRA"] # OCR variants
+ STORE_NAME = "KINETERRA CONCEPT SRL"
+
+ def extract_tva_entries(self, text: str) -> List[dict]:
+ """
+ Extract TVA entries - returns empty for non-VAT payer.
+
+ Kineterra is a non-VAT payer, so no TVA entries are expected.
+
+ Args:
+ text: Raw OCR text from receipt (unused)
+
+ Returns:
+ Empty list (non-VAT payer has no TVA)
+ """
+ # Non-VAT payer - no TVA entries
+ return []
+
+ def get_validation_hints(self) -> Dict[str, Any]:
+ """Return Kineterra-specific validation hints."""
+ return {
+ "has_multi_rate_tva": False,
+ "card_equals_total": False,
+ "has_client_cui": False,
+ "has_efactura": False,
+ "is_non_vat_payer": True,
+ "tva_pattern": "none",
+ }
diff --git a/backend/modules/data_entry/services/ocr/profiles/lidl.py b/backend/modules/data_entry/services/ocr/profiles/lidl.py
new file mode 100644
index 0000000..bcb180b
--- /dev/null
+++ b/backend/modules/data_entry/services/ocr/profiles/lidl.py
@@ -0,0 +1,93 @@
+"""
+LIDL store profile for OCR extraction.
+
+Lidl receipts have a specific TVA format without hyphen/colon separators:
+ TOTAL TVA 9,84
+ TVA A 21,00% 7,71
+ TVA B 11,00% 2,13
+
+This profile handles multi-rate TVA extraction for Lidl receipts.
+"""
+
+import re
+from decimal import Decimal, InvalidOperation
+from typing import List, Dict, Any
+
+from .base import BaseStoreProfile
+from . import ProfileRegistry
+
+
+@ProfileRegistry.register
+class LidlProfile(BaseStoreProfile):
+ """
+ LIDL DISCOUNT S.R.L. - multi-rate TVA profile.
+
+ Key characteristics:
+ - Multi-rate TVA (codes A, B, C, D with any percentage - patterns are flexible)
+ - TVA format: "TVA A XX,XX% YY,YY" (code + percent + amount on same line)
+ - Supports historical rates (19%, 9%, 5%) and current rates (21%, 11%)
+ - CARD payment usually equals total
+ - No client CUI on receipts
+ """
+
+ CUI_LIST = ["22891860"]
+ NAME_PATTERNS = ["LIDL", "LDL", "L1DL", "LIDL DISCOUNT"] # OCR variants
+ STORE_NAME = "LIDL DISCOUNT S.R.L."
+
+ # Lidl-specific TVA patterns
+ # Format: "TVA A 21,00% 7,71" (code + percent + amount on same line)
+ TVA_PATTERNS = [
+ # Primary: "TVA A 21,00% 7.71" with various spacing
+ r'T[VU][AR]\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
+ # With backslash OCR artifact: "TVA A \21,00% 7.71"
+ r'T[VU][AR]\s+([A-D])\s+\\?(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
+ # IVA variant (rare OCR misread)
+ r'IVA\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
+ ]
+
+ def extract_tva_entries(self, text: str) -> List[dict]:
+ """
+ Extract Lidl-specific TVA entries.
+
+ Handles multiple TVA rates (A, B, C, D) commonly found on Lidl receipts.
+ Uses deduplication to avoid counting the same entry twice from different patterns.
+
+ Args:
+ text: Raw OCR text from receipt
+
+ Returns:
+ List of TVA entries with code, percent, and amount
+ """
+ entries = []
+ seen = set() # Deduplication key: (code, percent)
+
+ for pattern in self.TVA_PATTERNS:
+ for match in re.finditer(pattern, text, re.IGNORECASE):
+ try:
+ code = match.group(1).upper()
+ percent = int(match.group(2))
+ amount = self._parse_decimal(match.group(3))
+
+ if amount and amount > 0:
+ entry_key = (code, percent)
+ if entry_key not in seen:
+ entries.append({
+ 'code': code,
+ 'percent': percent,
+ 'amount': amount
+ })
+ seen.add(entry_key)
+ except (ValueError, InvalidOperation):
+ continue
+
+ return entries
+
+ def get_validation_hints(self) -> Dict[str, Any]:
+ """Return Lidl-specific validation hints."""
+ return {
+ "has_multi_rate_tva": True,
+ "card_equals_total": True,
+ "has_client_cui": False,
+ "has_efactura": False,
+ "is_non_vat_payer": False,
+ }
diff --git a/backend/modules/data_entry/services/ocr/profiles/omv.py b/backend/modules/data_entry/services/ocr/profiles/omv.py
new file mode 100644
index 0000000..9cbff98
--- /dev/null
+++ b/backend/modules/data_entry/services/ocr/profiles/omv.py
@@ -0,0 +1,99 @@
+"""
+OMV Petrom store profile for OCR extraction.
+
+OMV receipts typically include client CUI and use standard TVA format.
+Common at gas stations with fuel purchases.
+
+Date format: YYYY. MM. DD with spaces (e.g., "2025. 08. 14")
+"""
+
+import re
+from datetime import date
+from decimal import Decimal, InvalidOperation
+from typing import List, Dict, Any, Tuple, Optional
+
+from .base import BaseStoreProfile
+from . import ProfileRegistry
+
+
+@ProfileRegistry.register
+class OMVProfile(BaseStoreProfile):
+ """
+ OMV PETROM MARKETING S.R.L. - standard TVA with client CUI.
+
+ Key characteristics:
+ - Standard TVA format (usually single rate, any percentage)
+ - Includes client CUI on receipt (for business purchases)
+ - TVA table format: "A-XX,XX% base_amount tva_amount"
+ - Supports historical rates (19%) and current rates (21%)
+ - Date format: YYYY. MM. DD (with spaces)
+ """
+
+ CUI_LIST = ["11201891"]
+ NAME_PATTERNS = ["OMV", "PETROM", "OMV PETROM", "0MV"] # OCR variants
+ STORE_NAME = "OMV PETROM MARKETING S.R.L."
+
+ # OMV TVA table pattern: "A-19,00% 285,66 49,58" (code-percent base tva)
+ TVA_TABLE_PATTERN = r'([A-D])\s*[-:]\s*(\d{1,2})[.,]\d{2}\s*%\s+([\d.,]+)\s+([\d.,]+)'
+
+ # Standard TVA pattern fallback
+ TVA_STANDARD_PATTERN = r'TVA\s*:?\s*([\d.,]+)'
+
+ # OMV specific: prioritize YYYY. MM. DD format with spaces
+ DATE_PATTERNS_OCR_SPACES = [
+ # YYYY. MM. DD with time (OMV format)
+ (r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})\s+\d{2}:\d{2}', 0.98, 'ymd'),
+ (r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})', 0.95, 'ymd'),
+ # Fallback to DD. MM. YYYY
+ (r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})\s+\d{2}:\d{2}', 0.92, 'dmy'),
+ (r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})', 0.85, 'dmy'),
+ ]
+
+ def extract_tva_entries(self, text: str) -> List[dict]:
+ """
+ Extract OMV-specific TVA entries.
+
+ OMV receipts often show TVA in table format with base and TVA amounts.
+ Falls back to standard extraction if table format not found.
+
+ Args:
+ text: Raw OCR text from receipt
+
+ Returns:
+ List of TVA entries with code, percent, and amount
+ """
+ entries = []
+ seen = set()
+
+ # Try table format first (more accurate)
+ for match in re.finditer(self.TVA_TABLE_PATTERN, text, re.IGNORECASE):
+ try:
+ code = match.group(1).upper()
+ percent = int(match.group(2))
+ # TVA amount is the second number (smaller one)
+ tva_amount = self._parse_decimal(match.group(4))
+
+ if tva_amount and tva_amount > 0:
+ entry_key = (code, percent)
+ if entry_key not in seen:
+ entries.append({
+ 'code': code,
+ 'percent': percent,
+ 'amount': tva_amount
+ })
+ seen.add(entry_key)
+ except (ValueError, InvalidOperation):
+ continue
+
+ return entries
+
+ def get_validation_hints(self) -> Dict[str, Any]:
+ """Return OMV-specific validation hints."""
+ return {
+ "has_multi_rate_tva": False,
+ "card_equals_total": False,
+ "has_client_cui": True,
+ "has_efactura": False,
+ "is_non_vat_payer": False,
+ "tva_table_format": True,
+ }
diff --git a/backend/modules/data_entry/services/ocr/profiles/pictus_velum.py b/backend/modules/data_entry/services/ocr/profiles/pictus_velum.py
new file mode 100644
index 0000000..cf985f7
--- /dev/null
+++ b/backend/modules/data_entry/services/ocr/profiles/pictus_velum.py
@@ -0,0 +1,101 @@
+"""
+PICTUS VELUM SRL store profile for OCR extraction.
+
+Office supplies and stationery store.
+"""
+
+import re
+from decimal import Decimal, InvalidOperation
+from typing import List, Dict, Any
+
+from .base import BaseStoreProfile
+from . import ProfileRegistry
+
+
+@ProfileRegistry.register
+class PictusVelumProfile(BaseStoreProfile):
+ """
+ PICTUS VELUM SRL - standard TVA profile.
+
+ Key characteristics:
+ - Standard TVA format (single rate, any percentage)
+ - Office supplies and stationery (rechizite)
+ - CARD payment typical
+ """
+
+ CUI_LIST = ["39634534"]
+ NAME_PATTERNS = ["PICTUS", "PICTUS VELUM", "P1CTUS", "PICTUS VELUM SRL"]
+ STORE_NAME = "PICTUS VELUM SRL"
+
+ # Standard TVA patterns (flexible - accepts any rate)
+ TVA_PATTERNS = [
+ # "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
+ r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
+ # "A - XX,XX% = YY,YY"
+ r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
+ # "TVA XX% YY,YY" (simple format without code)
+ r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
+ ]
+
+ def extract_tva_entries(self, text: str) -> List[dict]:
+ """
+ Extract TVA entries from receipt text.
+
+ Args:
+ text: Raw OCR text from receipt
+
+ Returns:
+ List of TVA entries with code, percent, and amount
+ """
+ entries = []
+ seen = set()
+
+ # Try coded patterns first
+ for pattern in self.TVA_PATTERNS[:2]:
+ for match in re.finditer(pattern, text, re.IGNORECASE):
+ try:
+ code = match.group(1).upper()
+ percent = int(match.group(2))
+ amount = self._parse_decimal(match.group(3))
+
+ if amount and amount > 0:
+ entry_key = (code, percent)
+ if entry_key not in seen:
+ entries.append({
+ 'code': code,
+ 'percent': percent,
+ 'amount': amount
+ })
+ seen.add(entry_key)
+ except (ValueError, InvalidOperation, IndexError):
+ continue
+
+ # Fallback to simple format
+ if not entries:
+ simple_pattern = self.TVA_PATTERNS[2]
+ for match in re.finditer(simple_pattern, text, re.IGNORECASE):
+ try:
+ percent = int(match.group(1))
+ amount = self._parse_decimal(match.group(2))
+
+ if amount and amount > 0:
+ entries.append({
+ 'code': 'A',
+ 'percent': percent,
+ 'amount': amount
+ })
+ break
+ except (ValueError, InvalidOperation):
+ continue
+
+ return entries
+
+ def get_validation_hints(self) -> Dict[str, Any]:
+ """Return PICTUS VELUM-specific validation hints."""
+ return {
+ "has_multi_rate_tva": False,
+ "card_equals_total": True,
+ "has_client_cui": False,
+ "has_efactura": False,
+ "is_non_vat_payer": False,
+ }
diff --git a/backend/modules/data_entry/services/ocr/profiles/socar.py b/backend/modules/data_entry/services/ocr/profiles/socar.py
new file mode 100644
index 0000000..541c9a4
--- /dev/null
+++ b/backend/modules/data_entry/services/ocr/profiles/socar.py
@@ -0,0 +1,111 @@
+"""
+SOCAR Petroleum store profile for OCR extraction.
+
+SOCAR receipts are similar to OMV - gas station with client CUI support.
+Date format may use YYYY. MM. DD with spaces.
+"""
+
+import re
+from datetime import date
+from decimal import Decimal, InvalidOperation
+from typing import List, Dict, Any, Tuple, Optional
+
+from .base import BaseStoreProfile
+from . import ProfileRegistry
+
+
+@ProfileRegistry.register
+class SocarProfile(BaseStoreProfile):
+ """
+ SOCAR PETROLEUM S.A. - standard TVA with client CUI.
+
+ Key characteristics:
+ - Standard TVA format (usually single rate)
+ - Includes client CUI on receipt (for business purchases)
+ - Similar format to OMV/Petrom
+ - Date format may use YYYY. MM. DD (with spaces)
+ """
+
+ CUI_LIST = ["12546600"]
+ NAME_PATTERNS = ["SOCAR", "S0CAR", "SOCAR PETROLEUM"] # OCR variants
+ STORE_NAME = "SOCAR PETROLEUM S.A."
+
+ # Standard TVA patterns for gas stations
+ TVA_PATTERNS = [
+ # Table format: "A-19,00% 285,66 49,58"
+ r'([A-D])\s*[-:]\s*(\d{1,2})[.,]\d{2}\s*%\s+([\d.,]+)\s+([\d.,]+)',
+ # Simple format: "TVA 19% 49,58"
+ r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
+ ]
+
+ # Gas stations may use YYYY. MM. DD format
+ DATE_PATTERNS_OCR_SPACES = [
+ (r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})\s+\d{2}:\d{2}', 0.98, 'ymd'),
+ (r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})', 0.95, 'ymd'),
+ (r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})\s+\d{2}:\d{2}', 0.92, 'dmy'),
+ (r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})', 0.85, 'dmy'),
+ ]
+
+ def extract_tva_entries(self, text: str) -> List[dict]:
+ """
+ Extract SOCAR-specific TVA entries.
+
+ Args:
+ text: Raw OCR text from receipt
+
+ Returns:
+ List of TVA entries with code, percent, and amount
+ """
+ entries = []
+ seen = set()
+
+ # Try table format first
+ table_pattern = self.TVA_PATTERNS[0]
+ for match in re.finditer(table_pattern, text, re.IGNORECASE):
+ try:
+ code = match.group(1).upper()
+ percent = int(match.group(2))
+ tva_amount = self._parse_decimal(match.group(4))
+
+ if tva_amount and tva_amount > 0:
+ entry_key = (code, percent)
+ if entry_key not in seen:
+ entries.append({
+ 'code': code,
+ 'percent': percent,
+ 'amount': tva_amount
+ })
+ seen.add(entry_key)
+ except (ValueError, InvalidOperation):
+ continue
+
+ # Fallback to simple format if no table entries found
+ if not entries:
+ simple_pattern = self.TVA_PATTERNS[1]
+ for match in re.finditer(simple_pattern, text, re.IGNORECASE):
+ try:
+ percent = int(match.group(1))
+ amount = self._parse_decimal(match.group(2))
+
+ if amount and amount > 0:
+ # Default to code 'A' for simple format
+ entries.append({
+ 'code': 'A',
+ 'percent': percent,
+ 'amount': amount
+ })
+ break # Only take first match for simple format
+ except (ValueError, InvalidOperation):
+ continue
+
+ return entries
+
+ def get_validation_hints(self) -> Dict[str, Any]:
+ """Return SOCAR-specific validation hints."""
+ return {
+ "has_multi_rate_tva": False,
+ "card_equals_total": False,
+ "has_client_cui": True,
+ "has_efactura": False,
+ "is_non_vat_payer": False,
+ }
diff --git a/backend/modules/data_entry/services/ocr/profiles/stepout_market.py b/backend/modules/data_entry/services/ocr/profiles/stepout_market.py
new file mode 100644
index 0000000..cda1b52
--- /dev/null
+++ b/backend/modules/data_entry/services/ocr/profiles/stepout_market.py
@@ -0,0 +1,112 @@
+"""
+STEPOUT MARKET SRL store profile for OCR extraction.
+
+Bookstore with reduced TVA rate (5% for books in Romania).
+"""
+
+import re
+from decimal import Decimal, InvalidOperation
+from typing import List, Dict, Any
+
+from .base import BaseStoreProfile
+from . import ProfileRegistry
+
+
+@ProfileRegistry.register
+class StepoutMarketProfile(BaseStoreProfile):
+ """
+ STEPOUT MARKET SRL - reduced TVA rate profile (books).
+
+ Key characteristics:
+ - Reduced TVA rate: 5% for books (cărți qualification in Romania)
+ - May also have standard rates for non-book items
+ - Patterns are flexible to accept ANY TVA rate
+ - CARD payment typical
+ """
+
+ CUI_LIST = ["35532655"]
+ NAME_PATTERNS = ["STEPOUT", "STEPOUT MARKET", "STEP0UT", "STEPOUT MARKET SRL"]
+ STORE_NAME = "STEPOUT MARKET SRL"
+
+ # TVA patterns (flexible - accepts any rate including 5%)
+ TVA_PATTERNS = [
+ # "TVA A: 5% = YY,YY" or "TVA-A 5% YY,YY" (coded format)
+ r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
+ # "A - 5,00% = YY,YY" (table format)
+ r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
+ # "TVA 5% YY,YY" (simple format - common for single rate)
+ r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
+ # "TVA 5,00%: YY,YY" (percent with colon)
+ r'TVA\s+(\d{1,2})[.,]\d{2}\s*%\s*:?\s*([\d.,]+)',
+ ]
+
+ def extract_tva_entries(self, text: str) -> List[dict]:
+ """
+ Extract TVA entries from receipt text.
+
+ Stepout Market primarily sells books which have 5% TVA in Romania.
+ The patterns are generic and will extract whatever rate is on the receipt.
+
+ Args:
+ text: Raw OCR text from receipt
+
+ Returns:
+ List of TVA entries with code, percent, and amount
+ """
+ entries = []
+ seen = set()
+
+ # Try coded patterns first (have code letter)
+ for pattern in self.TVA_PATTERNS[:2]:
+ for match in re.finditer(pattern, text, re.IGNORECASE):
+ try:
+ code = match.group(1).upper()
+ percent = int(match.group(2))
+ amount = self._parse_decimal(match.group(3))
+
+ if amount and amount > 0:
+ entry_key = (code, percent)
+ if entry_key not in seen:
+ entries.append({
+ 'code': code,
+ 'percent': percent,
+ 'amount': amount
+ })
+ seen.add(entry_key)
+ except (ValueError, InvalidOperation, IndexError):
+ continue
+
+ # Fallback to simple format (no code letter, just percent + amount)
+ if not entries:
+ for pattern in self.TVA_PATTERNS[2:]:
+ for match in re.finditer(pattern, text, re.IGNORECASE):
+ try:
+ percent = int(match.group(1))
+ amount = self._parse_decimal(match.group(2))
+
+ if amount and amount > 0:
+ # Default to code 'A' for simple format
+ entries.append({
+ 'code': 'A',
+ 'percent': percent,
+ 'amount': amount
+ })
+ break # Only take first match for simple format
+ except (ValueError, InvalidOperation):
+ continue
+ if entries:
+ break
+
+ return entries
+
+ def get_validation_hints(self) -> Dict[str, Any]:
+ """Return STEPOUT MARKET-specific validation hints."""
+ return {
+ "has_multi_rate_tva": False,
+ "card_equals_total": True,
+ "has_client_cui": True, # May have client CUI
+ "has_efactura": False,
+ "is_non_vat_payer": False,
+ "typical_tva_rate": 5, # Books have 5% TVA in Romania
+ "product_category": "books",
+ }
diff --git a/backend/modules/data_entry/services/ocr/profiles/unlimited_keys.py b/backend/modules/data_entry/services/ocr/profiles/unlimited_keys.py
new file mode 100644
index 0000000..3c629ca
--- /dev/null
+++ b/backend/modules/data_entry/services/ocr/profiles/unlimited_keys.py
@@ -0,0 +1,103 @@
+"""
+UNLIMITED KEYS S.R.L. store profile for OCR extraction.
+
+Key duplication service. Notable for CASH (NUMERAR) payments.
+"""
+
+import re
+from decimal import Decimal, InvalidOperation
+from typing import List, Dict, Any
+
+from .base import BaseStoreProfile
+from . import ProfileRegistry
+
+
+@ProfileRegistry.register
+class UnlimitedKeysProfile(BaseStoreProfile):
+ """
+ UNLIMITED KEYS S.R.L. - standard TVA profile with NUMERAR payment.
+
+ Key characteristics:
+ - Standard TVA format (single rate, any percentage)
+ - Key duplication service
+ - NUMERAR (cash) payment common - different from most stores!
+ - May also accept CARD
+ """
+
+ CUI_LIST = ["18993187"]
+ NAME_PATTERNS = ["UNLIMITED KEYS", "UNLIMITED", "UNL1MITED", "UNLIMITED KEYS SRL"]
+ STORE_NAME = "UNLIMITED KEYS S.R.L."
+
+ # Standard TVA patterns (flexible - accepts any rate)
+ TVA_PATTERNS = [
+ # "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
+ r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
+ # "A - XX,XX% = YY,YY"
+ r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
+ # "TVA XX% YY,YY" (simple format without code)
+ r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
+ ]
+
+ def extract_tva_entries(self, text: str) -> List[dict]:
+ """
+ Extract TVA entries from receipt text.
+
+ Args:
+ text: Raw OCR text from receipt
+
+ Returns:
+ List of TVA entries with code, percent, and amount
+ """
+ entries = []
+ seen = set()
+
+ # Try coded patterns first
+ for pattern in self.TVA_PATTERNS[:2]:
+ for match in re.finditer(pattern, text, re.IGNORECASE):
+ try:
+ code = match.group(1).upper()
+ percent = int(match.group(2))
+ amount = self._parse_decimal(match.group(3))
+
+ if amount and amount > 0:
+ entry_key = (code, percent)
+ if entry_key not in seen:
+ entries.append({
+ 'code': code,
+ 'percent': percent,
+ 'amount': amount
+ })
+ seen.add(entry_key)
+ except (ValueError, InvalidOperation, IndexError):
+ continue
+
+ # Fallback to simple format
+ if not entries:
+ simple_pattern = self.TVA_PATTERNS[2]
+ for match in re.finditer(simple_pattern, text, re.IGNORECASE):
+ try:
+ percent = int(match.group(1))
+ amount = self._parse_decimal(match.group(2))
+
+ if amount and amount > 0:
+ entries.append({
+ 'code': 'A',
+ 'percent': percent,
+ 'amount': amount
+ })
+ break
+ except (ValueError, InvalidOperation):
+ continue
+
+ return entries
+
+ def get_validation_hints(self) -> Dict[str, Any]:
+ """Return UNLIMITED KEYS-specific validation hints."""
+ return {
+ "has_multi_rate_tva": False,
+ "card_equals_total": False, # May be NUMERAR (cash)
+ "has_client_cui": True, # May have client CUI
+ "has_efactura": False,
+ "is_non_vat_payer": False,
+ "common_payment": "NUMERAR", # Cash payments common
+ }
diff --git a/backend/modules/data_entry/services/ocr_extractor.py b/backend/modules/data_entry/services/ocr_extractor.py
index 5a87190..2667c7a 100644
--- a/backend/modules/data_entry/services/ocr_extractor.py
+++ b/backend/modules/data_entry/services/ocr_extractor.py
@@ -7,6 +7,7 @@ from typing import Optional, Tuple, List
from dataclasses import dataclass, field
from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine
+from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
@dataclass
@@ -63,6 +64,57 @@ class ExtractionResult:
class ReceiptExtractor:
"""Extract receipt fields using pattern matching for Romanian receipts."""
+ # =========================================================================
+ # DEPRECATED: STORE_PROFILES dict - USE ProfileRegistry INSTEAD
+ # =========================================================================
+ # Store profiles are now managed by ProfileRegistry in:
+ # backend/modules/data_entry/services/ocr/profiles/
+ #
+ # This dict is kept for reference only. All extraction logic now uses:
+ # ProfileRegistry.get_profile(cui)
+ #
+ # See: backend/modules/data_entry/services/ocr/profiles/README.md
+ # =========================================================================
+ STORE_PROFILES = {
+ # Lidl - multi-rate TVA (A+B), specific format without hyphen/colon
+ "22891860": {
+ "name": "LIDL DISCOUNT S.R.L.",
+ "tva_pattern": "lidl",
+ "tva_format": "TVA {code} {percent}% {amount}",
+ "has_multi_rate_tva": True,
+ "card_equals_total": True,
+ },
+ # OMV Petrom - single TVA rate, client CUI included
+ "11201891": {
+ "name": "OMV PETROM MARKETING S.R.L.",
+ "tva_pattern": "standard",
+ "has_client_cui": True,
+ },
+ # FIVE-HOLDING (BRICK) - standard format
+ "10562600": {
+ "name": "FIVE-HOLDING S.A.",
+ "tva_pattern": "standard",
+ },
+ # Dedeman - e-factura format
+ "2816464": {
+ "name": "DEDEMAN SRL",
+ "tva_pattern": "standard",
+ "has_efactura": True,
+ },
+ # SOCAR Petroleum
+ "12546600": {
+ "name": "SOCAR PETROLEUM S.A.",
+ "tva_pattern": "standard",
+ "has_client_cui": True,
+ },
+ # Kineterra - non-VAT payer
+ "31180432": {
+ "name": "KINETERRA CONCEPT SRL",
+ "tva_pattern": "none",
+ "is_non_vat_payer": True,
+ },
+ }
+
# Total amount patterns (most specific first)
# Romanian receipts use various formats: TOTAL LEI, TOTAL:, TOTAL RON, etc.
# OCR often produces errors, so patterns must be tolerant
@@ -394,48 +446,101 @@ class ReceiptExtractor:
result.raw_text = text
text_upper = text.upper()
- # Extract core fields
- result.amount, result.confidence_amount = self._extract_amount(text_upper)
- result.receipt_date, result.confidence_date = self._extract_date(text_upper)
- result.receipt_number, _ = self._extract_number(text_upper)
- result.receipt_series, _ = self._extract_series(text_upper)
+ # =========================================================================
+ # STEP 1: Extract vendor info FIRST to find store profile
+ # =========================================================================
result.partner_name, result.confidence_vendor = self._extract_vendor(text)
result.cui, _ = self._extract_cui(text_upper, text)
- # Normalize CUI: fix R0 → RO OCR error and validate format
result.cui = OCRValidationEngine.normalize_cui(result.cui)
- # Extract additional fields - Multiple TVA entries
- result.tva_entries, result.tva_total = self._extract_tva_entries(text_upper)
+ # Lookup store-specific profile for enhanced extraction accuracy
+ store_profile = ProfileRegistry.get_profile(result.cui) if result.cui else None
+ if store_profile:
+ print(f"[Profile] Using {store_profile.__class__.__name__} for CUI {result.cui}", flush=True)
+
+ # =========================================================================
+ # STEP 2: Extract ALL fields using profile (if available) or generic
+ # =========================================================================
+ if store_profile:
+ # Profile-specific extraction (higher accuracy for known stores)
+ result.amount, result.confidence_amount = store_profile.extract_total(text_upper)
+ result.receipt_date, result.confidence_date = store_profile.extract_date(text_upper)
+ result.receipt_number, _ = store_profile.extract_receipt_number(text_upper)
+ result.tva_entries = store_profile.extract_tva_entries(text_upper)
+ result.tva_total = sum(e['amount'] for e in result.tva_entries) if result.tva_entries else None
+ result.payment_methods = store_profile.extract_payment_methods(text_upper)
+
+ # Client data extraction via profile (CUI + name)
+ profile_client_cui, cui_confidence = store_profile.extract_client_cui(text_upper)
+ profile_client_name, name_confidence = store_profile.extract_client_name(text)
+
+ if profile_client_cui or profile_client_name:
+ # Use profile extraction results
+ result.client_cui = OCRValidationEngine.normalize_cui(profile_client_cui) if profile_client_cui else None
+ result.client_name = profile_client_name
+ result.confidence_client = max(cui_confidence, name_confidence)
+ # Address still via generic (no profile method)
+ _, _, client_address, _ = self._extract_client_data(text_upper, text)
+ result.client_address = client_address
+ else:
+ # Fallback to generic client extraction
+ client_name, client_cui, client_address, confidence = self._extract_client_data(text_upper, text)
+ result.client_name = client_name
+ result.client_cui = OCRValidationEngine.normalize_cui(client_cui)
+ result.client_address = client_address
+ result.confidence_client = confidence
+
+ print(f"[Profile] Extracted: total={result.amount}, date={result.receipt_date}, "
+ f"TVA entries={len(result.tva_entries)}, payments={len(result.payment_methods)}", flush=True)
+ else:
+ # Generic extraction for unknown stores
+ result.amount, result.confidence_amount = self._extract_amount(text_upper)
+ result.receipt_date, result.confidence_date = self._extract_date(text_upper)
+ result.receipt_number, _ = self._extract_number(text_upper)
+ result.tva_entries, result.tva_total = self._extract_tva_entries(text_upper)
+ result.payment_methods = self._extract_payment_methods(text_upper)
+
+ # Generic client extraction
+ client_name, client_cui, client_address, confidence = self._extract_client_data(text_upper, text)
+ result.client_name = client_name
+ result.client_cui = OCRValidationEngine.normalize_cui(client_cui)
+ result.client_address = client_address
+ result.confidence_client = confidence
+
+ # Series extraction (no profile method, always generic)
+ result.receipt_series, _ = self._extract_series(text_upper)
+
+ # =========================================================================
+ # STEP 3: Debug logging and validation
+ # =========================================================================
if not result.tva_entries:
print(f"[TVA Debug] No TVA found. Checking patterns...", flush=True)
- # Debug: show what patterns see
normalized = re.sub(r'(\d+)[.,]\s+(\d{2})', r'\1.\2', text_upper)
taxe_match = re.search(r'T?OTAL\s+TAXE', normalized, re.IGNORECASE)
rev_match = re.search(r'([\d.,]+)\s*T?OTAL\s+TAXE', normalized, re.IGNORECASE)
print(f"[TVA Debug] 'OTAL TAXE' found: {bool(taxe_match)}, reversed: {rev_match.group(1) if rev_match else None}", flush=True)
- # Log TVA vs TOTAL for debugging (validation happens in ocr_service._final_validation)
- # NOTE: We NO LONGER clear TVA here - the service will recalculate TOTAL from TVA if needed
+ # Log TVA vs TOTAL for debugging
if result.tva_total and result.amount:
if result.tva_total > result.amount:
print(f"[TVA Extraction] TVA ({result.tva_total}) > TOTAL ({result.amount}) - will be corrected in final validation", flush=True)
elif result.tva_total > result.amount * Decimal('0.5'):
print(f"[TVA Extraction] Warning: TVA ({result.tva_total}) is > 50% of TOTAL ({result.amount}) - suspicious", flush=True)
+ # Additional generic extractions
result.items_count = self._extract_items_count(text_upper)
result.address = self._extract_address(text_upper)
- result.payment_methods = self._extract_payment_methods(text_upper)
- # Validate payment methods against extracted amount
- # If payment sum >> amount, clear invalid payments (likely OCR error)
+ # =========================================================================
+ # STEP 4: Validate and post-process
+ # =========================================================================
# Save original payment methods before validation (for payment mode detection)
original_payment_methods = result.payment_methods.copy() if result.payment_methods else []
+ # Validate payment methods against extracted amount
result.payment_methods = self._validate_payment_methods(result.payment_methods, result.amount)
# Auto-suggest payment_mode based on detected payment methods
- # Use ORIGINAL payment_methods to detect CARD even if validation cleared them
- # (e.g., CARD 318.16 is valid even if total validation failed)
payment_methods_for_mode = result.payment_methods if result.payment_methods else original_payment_methods
if payment_methods_for_mode:
card_amount = sum(
@@ -447,17 +552,9 @@ class ReceiptExtractor:
result.suggested_payment_mode = 'banca'
print(f"[Payment Mode] CARD detected ({card_amount}), suggesting 'banca'", flush=True)
else:
- # Only cash payments detected
result.suggested_payment_mode = 'numerar'
print(f"[Payment Mode] Cash only detected, suggesting 'numerar'", flush=True)
- # Extract client data (B2B receipts)
- client_name, client_cui, client_address, confidence_client = self._extract_client_data(text_upper, text)
- result.client_name = client_name
- result.client_cui = OCRValidationEngine.normalize_cui(client_cui) # Fix R0 → RO OCR error
- result.client_address = client_address
- result.confidence_client = confidence_client
-
# Detect receipt type
result.receipt_type = self._detect_receipt_type(text_upper)
@@ -620,6 +717,40 @@ class ReceiptExtractor:
return num_str
+ def _calculate_multi_rate_tva_total(self, tva_entries: List[dict]) -> Optional[Decimal]:
+ """
+ Calculate implied total from ALL TVA entries (multi-rate support).
+
+ Formula for each entry: total_for_entry = tva * (100 + rate) / rate
+ Final total = sum of all entry totals
+
+ Example for Lidl (TVA A 21% = 7.71, TVA B 11% = 2.13):
+ Entry A: 7.71 * 121 / 21 = 44.45
+ Entry B: 2.13 * 111 / 11 = 21.49
+ Total: 44.45 + 21.49 = 65.94 ≈ 65.86 (within tolerance)
+
+ Returns:
+ Implied total Decimal, or None if calculation not possible
+ """
+ if not tva_entries:
+ return None
+
+ total = Decimal('0')
+ for entry in tva_entries:
+ rate = entry.get('percent', 0)
+ tva_amount = entry.get('amount')
+ if tva_amount and rate > 0:
+ try:
+ tva_dec = Decimal(str(tva_amount))
+ # Formula: total_for_entry = tva * (100 + rate) / rate
+ entry_total = tva_dec * Decimal(100 + rate) / Decimal(rate)
+ total += entry_total
+ print(f"[Multi-rate TVA] Entry {entry.get('code', '?')}: tva={tva_amount}, rate={rate}% -> implied={entry_total:.2f}", flush=True)
+ except (InvalidOperation, ValueError, TypeError):
+ continue
+
+ return total.quantize(Decimal('0.01')) if total > 0 else None
+
def _cross_validate_and_calculate_amount(
self,
amount: Optional[Decimal],
@@ -634,12 +765,11 @@ class ReceiptExtractor:
Returns: (amount, confidence, source_description)
Logic:
- 1. If amount is valid (>0) with high confidence (>=0.8), use it directly
- 2. Calculate payment_sum = CARD + NUMERAR + other methods
- 3. Calculate tva_implied_total = tva_total * (100 + rate) / rate
- 4. Cross-validate: if payment_sum matches extracted amount, boost confidence
- 5. If amount is 0/None, use payment_sum as total
- 6. If payment_sum is 0, try to calculate from TVA
+ 1. Collect all available sources: extracted amount, payment sum, TVA-implied total
+ 2. Find consensus: 2+ sources within 3% tolerance
+ 3. If consensus found, use the higher-confidence source value
+ 4. If extracted differs >10% from all others, it's an outlier - correct it
+ 5. If no consensus possible, fallback to individual validations
"""
# Calculate payment methods sum
payment_sum = Decimal('0')
@@ -652,43 +782,73 @@ class ReceiptExtractor:
except (InvalidOperation, ValueError, TypeError):
continue
- # Calculate TVA-implied total: total = tva * (100 + rate) / rate
- tva_implied_total = None
- if tva_entries:
- # Use the main TVA entry (typically the largest or first one)
- main_entry = tva_entries[0]
- rate = main_entry.get('percent', 19)
- tva_amount = main_entry.get('amount')
- if tva_amount and rate > 0:
- try:
- tva_dec = Decimal(str(tva_amount))
- # total = tva * (100 + rate) / rate
- tva_implied_total = (tva_dec * Decimal(100 + rate) / Decimal(rate)).quantize(Decimal('0.01'))
- except (InvalidOperation, ValueError, TypeError):
- pass
+ # Calculate TVA-implied total using ALL entries (multi-rate fix)
+ tva_implied_total = self._calculate_multi_rate_tva_total(tva_entries)
- # Case 1: Amount is valid with high confidence - validate against TVA and payments
+ # Multi-source consensus approach (3% tolerance for multi-rate TVA rounding)
+ CONSENSUS_TOLERANCE = 3.0 # 3% tolerance
+
+ # Collect all available sources with their confidences
+ sources = []
+ if amount and amount > 0:
+ sources.append(('extracted', float(amount), confidence_amount))
+ if payment_sum > 0:
+ sources.append(('payment', float(payment_sum), 0.92)) # Payment is very reliable
+ if tva_implied_total and tva_implied_total > 0:
+ sources.append(('tva_calc', float(tva_implied_total), 0.88)) # TVA calc is reliable
+
+ print(f"[Cross-Validation] Sources: {[(s[0], f'{s[1]:.2f}', f'{s[2]:.2f}') for s in sources]}", flush=True)
+
+ # Find consensus: 2+ sources within tolerance
+ if len(sources) >= 2:
+ for i, (name1, val1, conf1) in enumerate(sources):
+ for name2, val2, conf2 in sources[i+1:]:
+ if val1 <= 0 or val2 <= 0:
+ continue
+ diff_pct = abs(val1 - val2) / max(val1, val2) * 100
+ if diff_pct <= CONSENSUS_TOLERANCE:
+ # Consensus found! Use value from higher-confidence source
+ if conf1 >= conf2:
+ consensus_val, consensus_conf = val1, conf1
+ else:
+ consensus_val, consensus_conf = val2, conf2
+ # Boost confidence for consensus
+ consensus_conf = min(0.98, consensus_conf + 0.05)
+ print(f"[Cross-Validation] Consensus: {name1}={val1:.2f} ≈ {name2}={val2:.2f} (diff={diff_pct:.1f}%)", flush=True)
+ return Decimal(str(round(consensus_val, 2))), consensus_conf, f"consensus ({name1}+{name2})"
+
+ # No consensus - check if extracted is an outlier (differs >10% from all others)
+ if amount and amount > 0 and len(sources) >= 2:
+ other_sources = [s for s in sources if s[0] != 'extracted']
+ if other_sources:
+ extracted_val = float(amount)
+ all_differ = all(
+ abs(extracted_val - s[1]) / max(extracted_val, s[1]) * 100 > 10
+ for s in other_sources if s[1] > 0
+ )
+ if all_differ:
+ # Extracted differs significantly from all others - use the best other source
+ best_other = max(other_sources, key=lambda s: s[2])
+ print(f"[Cross-Validation] Extracted outlier: {extracted_val:.2f} differs >10% from all others, using {best_other[0]}={best_other[1]:.2f}", flush=True)
+ return Decimal(str(round(best_other[1], 2))), best_other[2], f"corrected (extracted outlier, using {best_other[0]})"
+
+ # Fallback: Case 1 - Amount valid with high confidence
if amount and amount > 0 and confidence_amount >= 0.8:
- # First check TVA-implied total (most reliable when TVA is extracted correctly)
+ # Check TVA-implied total
if tva_implied_total and tva_implied_total > 0:
tva_diff_percent = abs(float(amount) - float(tva_implied_total)) / float(tva_implied_total) * 100
- if tva_diff_percent <= 1:
- # Near-perfect TVA match - highest confidence
+ if tva_diff_percent <= 3:
return amount, min(0.98, confidence_amount + 0.05), "extracted (validated by TVA)"
elif tva_diff_percent > 10:
- # Significant mismatch - TVA-implied total is more reliable
- # This catches cases where wrong TOTAL line was extracted (e.g., REST, SUBTOTAL)
print(f"[Cross-Validation] Amount mismatch with TVA: extracted={amount}, tva_implied={tva_implied_total} (diff={tva_diff_percent:.1f}%)", flush=True)
return tva_implied_total, 0.90, "calculated from TVA (extracted amount mismatch)"
# Cross-validate with payment methods
if payment_sum > 0 and abs(amount - payment_sum) <= Decimal('0.02'):
- # Perfect match - boost confidence
return amount, min(0.98, confidence_amount + 0.05), "extracted (validated by payment methods)"
elif payment_sum > 0:
payment_diff_percent = abs(float(amount) - float(payment_sum)) / float(payment_sum) * 100
if payment_diff_percent > 10:
- # Significant mismatch - payment sum is more reliable
print(f"[Cross-Validation] Amount mismatch with payments: extracted={amount}, payments={payment_sum} (diff={payment_diff_percent:.1f}%)", flush=True)
return payment_sum, 0.88, "calculated from payment methods (extracted amount mismatch)"
@@ -696,29 +856,22 @@ class ReceiptExtractor:
# Case 2: Amount exists but low confidence - try to validate/correct
if amount and amount > 0:
- # First check TVA-implied total (most reliable)
if tva_implied_total and tva_implied_total > 0:
tva_diff_percent = abs(float(amount) - float(tva_implied_total)) / float(tva_implied_total) * 100
- if tva_diff_percent <= 2:
- # Close match - boost confidence
+ if tva_diff_percent <= 3:
return amount, 0.88, "extracted (validated by TVA)"
elif tva_diff_percent > 10:
- # Significant mismatch - use TVA-implied total
print(f"[Cross-Validation] Amount mismatch with TVA: extracted={amount}, tva_implied={tva_implied_total} (diff={tva_diff_percent:.1f}%)", flush=True)
return tva_implied_total, 0.85, "calculated from TVA"
- # Check if payment methods sum matches
if payment_sum > 0:
payment_diff_percent = abs(float(amount) - float(payment_sum)) / float(payment_sum) * 100
- if payment_diff_percent <= 0.5:
- # Close match - boost confidence
+ if payment_diff_percent <= 1:
return amount, 0.90, "extracted (validated by payment methods)"
elif payment_diff_percent > 10:
- # Mismatch - prefer payment_sum as it's more reliable
print(f"[Cross-Validation] Amount mismatch: extracted={amount}, payments={payment_sum}", flush=True)
return payment_sum, 0.85, "calculated from payment methods"
- # No validation possible - return as-is
return amount, confidence_amount, "extracted (unvalidated)"
# Case 3: Amount is 0 or None - calculate from payment methods
@@ -946,6 +1099,28 @@ class ReceiptExtractor:
return name
+ def _get_store_profile(self, cui: Optional[str]) -> Optional[dict]:
+ """
+ Get store-specific profile by CUI.
+
+ DEPRECATED: Use ProfileRegistry.get_profile() directly for profile objects.
+ This method is kept for backward compatibility and returns validation hints dict.
+
+ Args:
+ cui: The CUI extracted from receipt (with or without RO prefix)
+
+ Returns:
+ Store profile validation hints dict or None if not found
+ """
+ profile = ProfileRegistry.get_profile(cui)
+ if profile:
+ # Return validation hints for backward compatibility
+ hints = profile.get_validation_hints()
+ hints['name'] = profile.STORE_NAME
+ print(f"[Store Profile] Found profile for {cui}: {profile.STORE_NAME}", flush=True)
+ return hints
+ return None
+
def _extract_cui(self, text_upper: str, original_text: str) -> Tuple[Optional[str], float]:
"""
Extract vendor CUI (fiscal identification code) from text.
@@ -1020,11 +1195,114 @@ class ReceiptExtractor:
# Default to bon_fiscal if neither found
return 'bon_fiscal'
+ def _try_pattern_lidl(self, text: str) -> List[dict]:
+ """
+ Try Lidl-style TVA pattern: "TVA A 21,00% 7.71" (no hyphen/colon separator).
+
+ Lidl receipts format:
+ TOTAL TVA 9,84
+ TVA A 21,00% 7,71
+ TVA B 11,00% 2,13
+
+ Returns list of TVA entries found.
+ """
+ entries = []
+ seen = set()
+
+ # Pattern: TVA/TUA/IVA + code (A-D) + percent + amount (on same line)
+ # Handles: "TVA A 21,00% 7,71", "TVA B 11,00% 2,13", "TUA A 21% 7.71"
+ lidl_patterns = [
+ # Same line: "TVA A 21,00% 7.71" (with various spacing)
+ r'T[VU][AR]\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
+ # Same line with backslash (OCR artifact): "TVA A \21,00% 7.71"
+ r'T[VU][AR]\s+([A-D])\s+\\?(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
+ # IVA variant
+ r'IVA\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
+ ]
+
+ for pattern in lidl_patterns:
+ for match in re.finditer(pattern, text, re.IGNORECASE):
+ try:
+ code = match.group(1).upper()
+ percent = int(match.group(2))
+ amount_str = self._normalize_number(match.group(3))
+ amount = Decimal(amount_str)
+
+ if amount > 0:
+ entry_key = (code, percent)
+ if entry_key not in seen:
+ entries.append({
+ 'code': code,
+ 'percent': percent,
+ 'amount': amount
+ })
+ seen.add(entry_key)
+ print(f"[TVA Lidl] Found: TVA {code} {percent}% = {amount}", flush=True)
+ except (ValueError, InvalidOperation):
+ continue
+
+ return entries
+
+ def _select_best_tva_candidate(
+ self,
+ candidates: List[tuple],
+ tva_bon_total: Optional[Decimal]
+ ) -> Tuple[List[dict], Optional[Decimal]]:
+ """
+ Select the best TVA candidate from collected candidates.
+
+ Selection criteria (priority order):
+ 1. Sum matches TOTAL TVA BON (highest priority)
+ 2. More entries = better (for multi-rate receipts)
+ 3. Pattern confidence as tiebreaker
+
+ Args:
+ candidates: List of (pattern_name, confidence, entries, sum)
+ tva_bon_total: Authoritative TOTAL TVA BON value (if extracted)
+
+ Returns:
+ (best_entries, best_sum)
+ """
+ if not candidates:
+ return [], None
+
+ # Score each candidate
+ scored = []
+ for name, confidence, entries, sum_val in candidates:
+ score = 0.0
+
+ # Criterion 1: Sum matches TOTAL TVA BON (highest priority)
+ if tva_bon_total and sum_val:
+ tolerance = max(Decimal('0.02'), tva_bon_total * Decimal('0.02')) # 2% tolerance
+ if abs(sum_val - tva_bon_total) <= tolerance:
+ score += 100 # High bonus for matching authoritative total
+ print(f"[TVA Select] {name}: sum {sum_val} matches tva_bon_total {tva_bon_total}", flush=True)
+
+ # Criterion 2: More entries (for multi-rate receipts)
+ score += len(entries) * 10
+
+ # Criterion 3: Pattern confidence
+ score += confidence * 5
+
+ scored.append((score, name, confidence, entries, sum_val))
+ print(f"[TVA Select] Candidate {name}: score={score:.1f}, entries={len(entries)}, sum={sum_val}", flush=True)
+
+ # Sort by score descending
+ scored.sort(key=lambda x: x[0], reverse=True)
+ best = scored[0]
+ print(f"[TVA Select] Winner: {best[1]} (score={best[0]:.1f})", flush=True)
+
+ return best[3], best[4]
+
def _extract_tva_entries(self, text: str) -> Tuple[List[dict], Optional[Decimal]]:
"""
Extract multiple TVA (VAT) entries from text.
Romanian receipts can have multiple TVA rates (A=19%, B=9%, C=5%, D=0%).
+ Uses CANDIDATE COLLECTION approach:
+ - Try ALL patterns and collect candidates
+ - Select best candidate based on matching TOTAL TVA BON
+
Returns (tva_entries, tva_total) where tva_entries is a list of:
{'code': 'A', 'percent': 19, 'amount': Decimal('15.20')}
"""
@@ -1054,6 +1332,22 @@ class ReceiptExtractor:
# Also normalize comma followed by space to comma (for "21, 00%" -> "21,00%")
normalized_text = re.sub(r'(\d+),\s+(\d{2})\s*%', r'\1.\2%', normalized_text)
+ # Extract TOTAL TVA BON/TOTAL TVA first as the authoritative reference
+ tva_bon_total = self._extract_total_tva_bon(normalized_text)
+ print(f"[TVA Debug] TOTAL TVA BON: {tva_bon_total}", flush=True)
+
+ # CANDIDATE COLLECTION APPROACH: Try all patterns, collect candidates, select best
+ all_candidates = [] # List of (pattern_name, confidence, entries, sum)
+
+ # === LIDL-STYLE PATTERNS (NEW) ===
+ # Lidl format: "TVA A 21,00% 7.71" or "TVA B 11,00% 2.13" (no hyphen/colon)
+ # This pattern handles multi-rate TVA receipts
+ lidl_entries = self._try_pattern_lidl(normalized_text)
+ if lidl_entries:
+ lidl_sum = sum(e['amount'] for e in lidl_entries)
+ all_candidates.append(('lidl', 0.96, lidl_entries, lidl_sum))
+ print(f"[TVA Debug] Lidl pattern: {len(lidl_entries)} entries, sum={lidl_sum}", flush=True)
+
# Pattern 0a: First try to get TVA from "TOTAL TAXE:" which is most reliable
# Format: "TOTAL TAXE: 55,22" - this is always the TVA amount
# OCR may cut "T" producing "OTAL TAXE:" instead of "TOTAL TAXE:"
@@ -1372,10 +1666,21 @@ class ReceiptExtractor:
except (ValueError, InvalidOperation):
continue
- # Extract TOTAL TVA BON as reference (separate from individual entries)
- tva_bon_total = self._extract_total_tva_bon(normalized_text)
+ # Add existing extraction results to candidates (if any)
+ if tva_entries:
+ entries_sum = sum(entry['amount'] for entry in tva_entries)
+ all_candidates.append(('standard', 0.90, tva_entries, entries_sum))
+ print(f"[TVA Debug] Standard patterns: {len(tva_entries)} entries, sum={entries_sum}", flush=True)
- # Calculate sum from entries
+ # === CANDIDATE SELECTION ===
+ # Select best candidate using TOTAL TVA BON as authoritative reference
+ if all_candidates:
+ best_entries, best_sum = self._select_best_tva_candidate(all_candidates, tva_bon_total)
+ if best_entries:
+ tva_entries = best_entries
+ entries_sum = best_sum
+
+ # Calculate sum from entries (if not set by candidate selection)
entries_sum = None
if tva_entries:
entries_sum = sum(entry['amount'] for entry in tva_entries)
diff --git a/scripts/generate_store_profile.py b/scripts/generate_store_profile.py
new file mode 100755
index 0000000..86f37a4
--- /dev/null
+++ b/scripts/generate_store_profile.py
@@ -0,0 +1,600 @@
+#!/usr/bin/env python3
+"""
+Store Profile Generator Script
+
+Analyzes PDF receipts from a store and generates a Python profile class
+for the OCR extraction system.
+
+Usage:
+ python scripts/generate_store_profile.py \
+ --name "Magazin Exemplu" \
+ --cui "12345678" \
+ --receipts "docs/data-entry/MagazinExemplu*.pdf" \
+ --output "backend/modules/data_entry/services/ocr/profiles/magazin_exemplu.py"
+
+Features:
+ - Submits PDFs to OCR API
+ - Analyzes extracted text for patterns (TVA, total, date, payment)
+ - Generates a BaseStoreProfile subclass with detected patterns
+ - Supports hot-reload via ProfileRegistry
+
+Requirements:
+ - Backend server running on localhost:8000
+ - JWT authentication
+ - python-jose, requests packages
+"""
+
+import argparse
+import glob
+import json
+import os
+import re
+import sys
+import time
+from collections import Counter, defaultdict
+from datetime import datetime, timedelta, timezone
+from pathlib import Path
+from typing import Dict, List, Optional, Tuple
+
+try:
+ import requests
+ from jose import jwt
+except ImportError:
+ print("Error: Required packages not installed.")
+ print("Run: pip install python-jose requests")
+ sys.exit(1)
+
+
+# Configuration
+API_BASE = os.getenv("API_BASE", "http://localhost:8000")
+JWT_SECRET = os.getenv("JWT_SECRET_KEY", "GENERATE_NEW_SECRET_FOR_PRODUCTION3334!")
+
+
+def create_jwt_token() -> str:
+ """Create a test JWT token for API authentication."""
+ payload = {
+ "username": "PROFILE_GENERATOR",
+ "user_id": 1,
+ "companies": ["604"],
+ "permissions": ["read", "write"],
+ "exp": datetime.now(timezone.utc) + timedelta(hours=1),
+ "iat": datetime.now(timezone.utc),
+ "type": "access"
+ }
+ return jwt.encode(payload, JWT_SECRET, algorithm="HS256")
+
+
+def submit_ocr(pdf_path: str, token: str, api_base: str = API_BASE, timeout: int = 120) -> Optional[Dict]:
+ """
+ Submit a PDF to OCR API and wait for result.
+
+ Args:
+ pdf_path: Path to PDF file
+ token: JWT authentication token
+ api_base: API base URL
+ timeout: Max seconds to wait for completion
+
+ Returns:
+ Extraction result dict or None on failure
+ """
+ headers = {"Authorization": f"Bearer {token}"}
+ filename = os.path.basename(pdf_path)
+
+ print(f" Submitting: {filename}...", end=" ", flush=True)
+
+ try:
+ with open(pdf_path, "rb") as f:
+ files = {"file": (filename, f, "application/pdf")}
+ response = requests.post(
+ f"{api_base}/api/data-entry/ocr/extract?engine=doctr_plus",
+ files=files,
+ headers=headers,
+ timeout=30
+ )
+
+ if response.status_code != 200:
+ print(f"FAILED (HTTP {response.status_code})")
+ return None
+
+ job_data = response.json()
+ job_id = job_data.get("job_id")
+
+ if not job_id:
+ print("FAILED (no job_id)")
+ return None
+
+ # Poll for completion
+ start_time = time.time()
+ while time.time() - start_time < timeout:
+ poll_response = requests.get(
+ f"{api_base}/api/data-entry/ocr/jobs/{job_id}/wait?timeout=30",
+ headers=headers,
+ timeout=35
+ )
+
+ if poll_response.status_code == 200:
+ job_result = poll_response.json()
+ status = job_result.get("status")
+
+ if status == "completed":
+ elapsed = time.time() - start_time
+ print(f"OK ({elapsed:.1f}s)")
+ return job_result.get("result", {})
+ elif status == "error":
+ print(f"ERROR: {job_result.get('error', 'Unknown')}")
+ return None
+
+ time.sleep(2)
+
+ print("TIMEOUT")
+ return None
+
+ except Exception as e:
+ print(f"EXCEPTION: {e}")
+ return None
+
+
+def analyze_tva_patterns(results: List[Dict]) -> Dict:
+ """
+ Analyze TVA patterns from multiple extraction results.
+
+ Returns:
+ Dict with detected patterns and statistics
+ """
+ tva_entries = []
+ raw_texts = []
+
+ for r in results:
+ if r.get("tva_entries"):
+ tva_entries.extend(r["tva_entries"])
+ if r.get("raw_text"):
+ raw_texts.append(r["raw_text"])
+
+ # Analyze TVA code patterns (A, B, C, etc.)
+ codes = Counter(e.get("code") for e in tva_entries if e.get("code"))
+
+ # Analyze TVA percentage patterns
+ percents = Counter(e.get("percent") for e in tva_entries if e.get("percent"))
+
+ # Detect TVA format from raw text
+ tva_formats = defaultdict(int)
+ for text in raw_texts:
+ text_upper = text.upper()
+
+ # Standard format: "TVA 19% 10.50" or "TVA: 19% 10.50"
+ if re.search(r'TVA\s*:?\s*\d{1,2}%', text_upper):
+ tva_formats["standard"] += 1
+
+ # Lidl format: "TVA A 21% 7.71"
+ if re.search(r'TVA\s+[A-D]\s+\d{1,2}', text_upper):
+ tva_formats["lidl_multi_rate"] += 1
+
+ # Table format: "BAZA TVA | % TVA | VALOARE TVA"
+ if re.search(r'BAZA\s+TVA', text_upper):
+ tva_formats["table"] += 1
+
+ # No TVA (neplatitor)
+ if re.search(r'NEPLATITOR|NON.?TVA', text_upper):
+ tva_formats["non_vat"] += 1
+
+ return {
+ "codes": dict(codes),
+ "percents": dict(percents),
+ "formats": dict(tva_formats),
+ "has_multi_rate": len(codes) > 1,
+ "is_non_vat": tva_formats.get("non_vat", 0) > 0,
+ "dominant_format": max(tva_formats, key=tva_formats.get) if tva_formats else "standard"
+ }
+
+
+def analyze_total_patterns(results: List[Dict]) -> Dict:
+ """Analyze TOTAL patterns from extraction results."""
+ totals = []
+ raw_texts = []
+
+ for r in results:
+ if r.get("amount"):
+ totals.append(float(r["amount"]))
+ if r.get("raw_text"):
+ raw_texts.append(r["raw_text"])
+
+ total_formats = defaultdict(int)
+ for text in raw_texts:
+ text_upper = text.upper()
+
+ if re.search(r'TOTAL\s*:?\s*[\d.,]+', text_upper):
+ total_formats["TOTAL:"] += 1
+ if re.search(r'TOTAL\s+DE\s+PLAT', text_upper):
+ total_formats["TOTAL DE PLATA"] += 1
+ if re.search(r'SUMA\s+TOTAL', text_upper):
+ total_formats["SUMA TOTALA"] += 1
+ if re.search(r'GRAND\s*TOTAL', text_upper):
+ total_formats["GRAND TOTAL"] += 1
+
+ return {
+ "count": len(totals),
+ "formats": dict(total_formats),
+ "dominant_format": max(total_formats, key=total_formats.get) if total_formats else "TOTAL"
+ }
+
+
+def analyze_date_patterns(results: List[Dict]) -> Dict:
+ """Analyze date patterns from extraction results."""
+ dates = []
+ raw_texts = []
+
+ for r in results:
+ if r.get("receipt_date"):
+ dates.append(r["receipt_date"])
+ if r.get("raw_text"):
+ raw_texts.append(r["raw_text"])
+
+ date_formats = defaultdict(int)
+ for text in raw_texts:
+ # DD.MM.YYYY
+ if re.search(r'\d{2}\.\d{2}\.\d{4}', text):
+ date_formats["DD.MM.YYYY"] += 1
+ # YYYY.MM.DD (OMV/SOCAR style)
+ if re.search(r'\d{4}\.\d{2}\.\d{2}', text):
+ date_formats["YYYY.MM.DD"] += 1
+ # DD-MM-YYYY
+ if re.search(r'\d{2}-\d{2}-\d{4}', text):
+ date_formats["DD-MM-YYYY"] += 1
+ # DD/MM/YYYY
+ if re.search(r'\d{2}/\d{2}/\d{4}', text):
+ date_formats["DD/MM/YYYY"] += 1
+
+ return {
+ "extracted_dates": dates,
+ "formats": dict(date_formats),
+ "dominant_format": max(date_formats, key=date_formats.get) if date_formats else "DD.MM.YYYY"
+ }
+
+
+def analyze_payment_patterns(results: List[Dict]) -> Dict:
+ """Analyze payment method patterns."""
+ payment_counts = defaultdict(int)
+
+ for r in results:
+ methods = r.get("payment_methods", [])
+ for m in methods:
+ method_type = m.get("method", "UNKNOWN")
+ payment_counts[method_type] += 1
+
+ return {
+ "methods": dict(payment_counts),
+ "has_mixed_payments": len(payment_counts) > 1
+ }
+
+
+def analyze_client_patterns(results: List[Dict]) -> Dict:
+ """Analyze client (B2B) patterns."""
+ has_client_cui = 0
+ has_client_name = 0
+
+ for r in results:
+ if r.get("client_cui"):
+ has_client_cui += 1
+ if r.get("client_name"):
+ has_client_name += 1
+
+ return {
+ "has_client_cui": has_client_cui > 0,
+ "has_client_name": has_client_name > 0,
+ "b2b_ratio": has_client_cui / len(results) if results else 0
+ }
+
+
+def generate_profile_code(
+ store_name: str,
+ cui: str,
+ tva_analysis: Dict,
+ total_analysis: Dict,
+ date_analysis: Dict,
+ payment_analysis: Dict,
+ client_analysis: Dict
+) -> str:
+ """
+ Generate Python profile class code.
+
+ Args:
+ store_name: Human-readable store name
+ cui: CUI number (without RO prefix)
+ *_analysis: Analysis results from pattern detection
+
+ Returns:
+ Python source code for the profile class
+ """
+ # Generate class name from store name
+ class_name = "".join(
+ word.capitalize()
+ for word in re.sub(r'[^a-zA-Z0-9\s]', '', store_name).split()
+ ) + "Profile"
+
+ # Generate module name
+ module_name = re.sub(r'[^a-z0-9]', '_', store_name.lower()).strip('_')
+
+ # Determine profile characteristics
+ is_non_vat = tva_analysis.get("is_non_vat", False)
+ has_multi_rate = tva_analysis.get("has_multi_rate", False)
+ has_client_cui = client_analysis.get("has_client_cui", False)
+ uses_yyyy_mm_dd = date_analysis.get("dominant_format") == "YYYY.MM.DD"
+
+ # Generate OCR name patterns
+ name_words = store_name.upper().split()
+ primary_word = name_words[0] if name_words else store_name.upper()
+ name_patterns = [
+ primary_word,
+ store_name.upper().replace(".", "").replace(",", ""),
+ ]
+ # Add OCR error variants
+ ocr_variants = {
+ 'O': '0', 'I': '1', 'L': '1', 'S': '5', 'B': '8', 'E': '3'
+ }
+ for char, replacement in ocr_variants.items():
+ if char in primary_word:
+ name_patterns.append(primary_word.replace(char, replacement, 1))
+
+ name_patterns = list(dict.fromkeys(name_patterns))[:4] # Unique, max 4
+
+ # Build the code
+ code_lines = [
+ '"""',
+ f'{store_name} store profile for OCR extraction.',
+ '',
+ 'Auto-generated by generate_store_profile.py',
+ f'Generated: {datetime.now().strftime("%Y-%m-%d %H:%M")}',
+ '"""',
+ '',
+ 'import re',
+ 'from decimal import Decimal, InvalidOperation',
+ 'from typing import List, Dict, Any',
+ '',
+ 'from .base import BaseStoreProfile',
+ 'from . import ProfileRegistry',
+ '',
+ '',
+ '@ProfileRegistry.register',
+ f'class {class_name}(BaseStoreProfile):',
+ ' """',
+ f' {store_name} - OCR extraction profile.',
+ ' ',
+ ]
+
+ # Add characteristics to docstring
+ characteristics = []
+ if is_non_vat:
+ characteristics.append("Non-VAT payer (neplatitor TVA)")
+ if has_multi_rate:
+ characteristics.append("Multi-rate TVA")
+ if has_client_cui:
+ characteristics.append("B2B receipts with client CUI")
+ if uses_yyyy_mm_dd:
+ characteristics.append("Date format: YYYY.MM.DD")
+
+ if characteristics:
+ code_lines.append(' Key characteristics:')
+ for c in characteristics:
+ code_lines.append(f' - {c}')
+ code_lines.append(' ')
+
+ code_lines.extend([
+ ' """',
+ '',
+ f' CUI_LIST = ["{cui}"]',
+ f' NAME_PATTERNS = {name_patterns}',
+ f' STORE_NAME = "{store_name}"',
+ '',
+ ])
+
+ # Add date patterns override for YYYY.MM.DD format
+ if uses_yyyy_mm_dd:
+ code_lines.extend([
+ ' # Override date patterns for YYYY.MM.DD format',
+ ' DATE_PATTERNS_OCR_SPACES = [',
+ ' r\'(\\d{4})[.,]\\s*(\\d{2})[.,]\\s*(\\d{2})\', # YYYY. MM. DD with spaces',
+ ' r\'(\\d{4})[.,](\\d{2})[.,](\\d{2})\', # YYYY.MM.DD',
+ ' ]',
+ '',
+ ])
+
+ # Add TVA extraction method for multi-rate or non-VAT
+ if is_non_vat:
+ code_lines.extend([
+ ' def extract_tva_entries(self, text: str) -> List[dict]:',
+ ' """Non-VAT payer - returns empty list."""',
+ ' return []',
+ '',
+ ])
+ elif has_multi_rate and tva_analysis.get("dominant_format") == "lidl_multi_rate":
+ code_lines.extend([
+ ' # Store-specific TVA patterns',
+ ' TVA_PATTERNS = [',
+ ' r\'T[VU][AR]\\s+([A-D])\\s+(\\d{1,2})[.,]?\\d{0,2}\\s*%\\s+([\\d.,]+)\',',
+ ' ]',
+ '',
+ ' def extract_tva_entries(self, text: str) -> List[dict]:',
+ ' """Extract multi-rate TVA entries."""',
+ ' entries = []',
+ ' seen = set()',
+ '',
+ ' for pattern in self.TVA_PATTERNS:',
+ ' for match in re.finditer(pattern, text, re.IGNORECASE):',
+ ' try:',
+ ' code = match.group(1).upper()',
+ ' percent = int(match.group(2))',
+ ' amount = self._parse_decimal(match.group(3))',
+ '',
+ ' if amount and amount > 0:',
+ ' entry_key = (code, percent)',
+ ' if entry_key not in seen:',
+ ' entries.append({',
+ ' \'code\': code,',
+ ' \'percent\': percent,',
+ ' \'amount\': amount',
+ ' })',
+ ' seen.add(entry_key)',
+ ' except (ValueError, InvalidOperation):',
+ ' continue',
+ '',
+ ' return entries',
+ '',
+ ])
+
+ # Add validation hints method
+ code_lines.extend([
+ ' def get_validation_hints(self) -> Dict[str, Any]:',
+ f' """Return {store_name}-specific validation hints."""',
+ ' return {',
+ f' "has_multi_rate_tva": {has_multi_rate},',
+ f' "card_equals_total": True,',
+ f' "has_client_cui": {has_client_cui},',
+ f' "has_efactura": False,',
+ f' "is_non_vat_payer": {is_non_vat},',
+ ' }',
+ ])
+
+ return '\n'.join(code_lines) + '\n'
+
+
+def main():
+ parser = argparse.ArgumentParser(
+ description="Generate store profile from PDF receipts",
+ formatter_class=argparse.RawDescriptionHelpFormatter,
+ epilog="""
+Examples:
+ # Generate profile from a single PDF
+ python scripts/generate_store_profile.py \\
+ --name "Magazin Nou" --cui "12345678" \\
+ --receipts "docs/data-entry/magazin_nou.pdf"
+
+ # Generate profile from multiple PDFs (glob pattern)
+ python scripts/generate_store_profile.py \\
+ --name "Carrefour" --cui "2475489" \\
+ --receipts "docs/data-entry/Carrefour*.pdf" \\
+ --output backend/modules/data_entry/services/ocr/profiles/carrefour.py
+
+ # Dry run (analyze only, don't write file)
+ python scripts/generate_store_profile.py \\
+ --name "Test Store" --cui "11111111" \\
+ --receipts "docs/data-entry/test*.pdf" \\
+ --dry-run
+ """
+ )
+
+ parser.add_argument("--name", required=True, help="Store name (e.g., 'LIDL DISCOUNT S.R.L.')")
+ parser.add_argument("--cui", required=True, help="CUI number without RO prefix")
+ parser.add_argument("--receipts", required=True, help="PDF file path or glob pattern")
+ parser.add_argument("--output", help="Output file path (default: auto-generated)")
+ parser.add_argument("--dry-run", action="store_true", help="Analyze only, don't write file")
+ parser.add_argument("--api-base", default=API_BASE, help=f"API base URL (default: {API_BASE})")
+
+ args = parser.parse_args()
+
+ # Update API base if provided
+ api_base = args.api_base
+
+ # Validate CUI format
+ cui = args.cui.strip().replace("RO", "").replace(" ", "")
+ if not cui.isdigit() or len(cui) < 6 or len(cui) > 10:
+ print(f"Error: Invalid CUI format: {args.cui}")
+ sys.exit(1)
+
+ # Find PDF files
+ pdf_files = glob.glob(args.receipts)
+ if not pdf_files:
+ print(f"Error: No PDF files found matching: {args.receipts}")
+ sys.exit(1)
+
+ print(f"\n{'='*60}")
+ print(f"Store Profile Generator")
+ print(f"{'='*60}")
+ print(f"Store: {args.name}")
+ print(f"CUI: {cui}")
+ print(f"PDFs: {len(pdf_files)} files")
+ print(f"{'='*60}\n")
+
+ # Generate JWT token
+ token = create_jwt_token()
+
+ # Submit PDFs to OCR
+ print("Step 1: Submitting PDFs to OCR API...")
+ results = []
+ for pdf_path in pdf_files:
+ result = submit_ocr(pdf_path, token, api_base=api_base)
+ if result:
+ results.append(result)
+
+ if not results:
+ print("\nError: No successful extractions. Check if backend is running.")
+ sys.exit(1)
+
+ print(f"\nSuccessfully extracted: {len(results)}/{len(pdf_files)} PDFs")
+
+ # Analyze patterns
+ print("\nStep 2: Analyzing patterns...")
+ tva_analysis = analyze_tva_patterns(results)
+ total_analysis = analyze_total_patterns(results)
+ date_analysis = analyze_date_patterns(results)
+ payment_analysis = analyze_payment_patterns(results)
+ client_analysis = analyze_client_patterns(results)
+
+ print(f" TVA: {tva_analysis['dominant_format']} format, multi-rate={tva_analysis['has_multi_rate']}")
+ print(f" Date: {date_analysis['dominant_format']} format")
+ print(f" Payments: {list(payment_analysis['methods'].keys())}")
+ print(f" B2B: {client_analysis['has_client_cui']}")
+
+ # Generate profile code
+ print("\nStep 3: Generating profile code...")
+ code = generate_profile_code(
+ store_name=args.name,
+ cui=cui,
+ tva_analysis=tva_analysis,
+ total_analysis=total_analysis,
+ date_analysis=date_analysis,
+ payment_analysis=payment_analysis,
+ client_analysis=client_analysis
+ )
+
+ # Determine output path
+ if args.output:
+ output_path = args.output
+ else:
+ module_name = re.sub(r'[^a-z0-9]', '_', args.name.lower()).strip('_')
+ output_path = f"backend/modules/data_entry/services/ocr/profiles/{module_name}.py"
+
+ if args.dry_run:
+ print(f"\n[DRY RUN] Would write to: {output_path}")
+ print(f"\n{'='*60}")
+ print("Generated code:")
+ print(f"{'='*60}")
+ print(code)
+ else:
+ # Write file
+ os.makedirs(os.path.dirname(output_path), exist_ok=True)
+ with open(output_path, 'w') as f:
+ f.write(code)
+ print(f" Written to: {output_path}")
+
+ # Verify syntax
+ import py_compile
+ try:
+ py_compile.compile(output_path, doraise=True)
+ print(f" Syntax check: OK")
+ except py_compile.PyCompileError as e:
+ print(f" Syntax check: FAILED - {e}")
+
+ print(f"\n{'='*60}")
+ print("Profile generation complete!")
+ print(f"{'='*60}")
+
+ if not args.dry_run:
+ print(f"\nNext steps:")
+ print(f"1. Review the generated code: {output_path}")
+ print(f"2. Customize patterns if needed")
+ print(f"3. Hot-reload profiles: curl -X POST http://localhost:8000/api/data-entry/ocr/profiles/reload")
+ print(f"4. Test with a sample receipt")
+
+
+if __name__ == "__main__":
+ main()