diff --git a/.claude/rules/claude-learn-backend.md b/.claude/rules/claude-learn-backend.md new file mode 100644 index 0000000..3cd3a9f --- /dev/null +++ b/.claude/rules/claude-learn-backend.md @@ -0,0 +1,106 @@ +# Claude Learn: Backend + +**Domain**: backend +**Last updated**: 2026-01-06 +**Sessions recorded**: 1 + +Knowledge about FastAPI, Python services, Oracle DB, and backend architecture. + +--- + +## Patterns + +### ProfileRegistry cu Hot-Reload pentru Store Profiles +**Discovered**: 2026-01-06 (feature: ocr-store-profiles) +**Description**: Sistem de înregistrare profile OCR folosind decorator `@ProfileRegistry.register` cu hot-reload via `importlib.reload()`. Permite adăugarea/modificarea profilelor fără restart server. + +**Example** (`backend/modules/data_entry/services/ocr/profiles/__init__.py`): +```python +class ProfileRegistry: + _profiles: Dict[str, Type["BaseStoreProfile"]] = {} + _instances: Dict[str, "BaseStoreProfile"] = {} + + @classmethod + def register(cls, profile_class): + """Decorator to register a store profile class.""" + for cui in profile_class.CUI_LIST: + cls._profiles[cls._normalize_cui(cui)] = profile_class + return profile_class + + @classmethod + def reload_all(cls): + """Hot-reload all profile modules via importlib.reload().""" + cls._instances.clear() + for module_name in cls._get_profile_module_names(): + importlib.reload(sys.modules[f"backend...profiles.{module_name}"]) +``` + +**Usage**: +```python +@ProfileRegistry.register +class LidlProfile(BaseStoreProfile): + CUI_LIST = ["22891860"] + STORE_NAME = "LIDL DISCOUNT S.R.L." + +# Lookup +profile = ProfileRegistry.get_profile("22891860") + +# Hot-reload (endpoint) +POST /api/data-entry/ocr/profiles/reload +``` + +**Tags**: registry-pattern, hot-reload, decorator, ocr, singleton + +--- + +### Script generare cod Python din analiză PDF +**Discovered**: 2026-01-06 (feature: ocr-store-profiles) +**Description**: Script care analizează PDF-uri via OCR API, detectează pattern-uri (TVA format, date format, payment) și generează automat cod Python pentru profile noi. Include JWT auth, async polling, și verificare sintaxă. + +**Example** (`scripts/generate_store_profile.py`): +```python +def analyze_tva_patterns(results: List[Dict]) -> Dict: + """Detectează format TVA dominant din rezultatele OCR.""" + tva_formats = defaultdict(int) + for text in raw_texts: + if re.search(r'TVA\s+[A-D]\s+\d{1,2}', text_upper): + tva_formats["lidl_multi_rate"] += 1 + if re.search(r'BAZA\s+TVA', text_upper): + tva_formats["table"] += 1 + return {"dominant_format": max(tva_formats, key=tva_formats.get)} + +def generate_profile_code(store_name, cui, tva_analysis, ...): + """Generează cod Python pentru clasa de profil.""" + # Template-based generation cu OCR error variants +``` + +**Usage**: +```bash +# Dry-run pentru preview +python scripts/generate_store_profile.py \ + --name "Magazin Nou" --cui "12345678" \ + --receipts "docs/data-entry/MagazinNou*.pdf" --dry-run + +# Generează și salvează +python scripts/generate_store_profile.py \ + --name "Magazin Nou" --cui "12345678" \ + --receipts "docs/data-entry/MagazinNou*.pdf" \ + --output backend/.../profiles/magazin_nou.py +``` + +**Tags**: code-generation, ocr, automation, cli-tool + +--- + +## Gotchas + +_(None recorded yet)_ + +--- + +## Statistics + +- **Total Patterns**: 2 +- **Total Gotchas**: 0 +- **Last Session**: 2026-01-06 +- **Sessions Recorded**: 1 diff --git a/.claude/rules/claude-learn-database.md b/.claude/rules/claude-learn-database.md new file mode 100644 index 0000000..5997042 --- /dev/null +++ b/.claude/rules/claude-learn-database.md @@ -0,0 +1,28 @@ +# Claude Learn: Database + +**Domain**: database +**Last updated**: - +**Sessions recorded**: 0 + +Knowledge about Oracle DB, SQLite, SQLModel, migrations, and data modeling. + +--- + +## Patterns + +_(None recorded yet)_ + +--- + +## Gotchas + +_(None recorded yet)_ + +--- + +## Statistics + +- **Total Patterns**: 0 +- **Total Gotchas**: 0 +- **Last Session**: - +- **Sessions Recorded**: 0 diff --git a/.claude/rules/claude-learn-deployment.md b/.claude/rules/claude-learn-deployment.md new file mode 100644 index 0000000..c7b031f --- /dev/null +++ b/.claude/rules/claude-learn-deployment.md @@ -0,0 +1,55 @@ +# Claude Learn: Deployment + +**Domain**: deployment +**Last updated**: 2026-01-06 +**Sessions recorded**: 1 + +Knowledge about IIS, Docker, deployment scripts, and infrastructure. + +--- + +## Patterns + +### IIS URL Rewrite Rules for SPA with Multiple API Backends +**Discovered**: 2025-12-22 (feature: unified-app) +**Description**: Configure IIS web.config to proxy different API paths to different backend ports while serving SPA for all other routes. Enables single IIS site to route to multiple microservices. + +**Example** (`public/web.config:5-28`): +```xml + + + + + + + + + + + + + + + + + + + +``` + +**Tags**: iis, deployment, spa, microservices, proxy + +--- + +## Gotchas + +_(None recorded yet)_ + +--- + +## Statistics + +- **Total Patterns**: 1 +- **Total Gotchas**: 0 +- **Last Session**: 2026-01-06 +- **Sessions Recorded**: 1 diff --git a/.claude/rules/claude-learn-domains.md b/.claude/rules/claude-learn-domains.md new file mode 100644 index 0000000..751e43f --- /dev/null +++ b/.claude/rules/claude-learn-domains.md @@ -0,0 +1,83 @@ +# Claude Learn Domains Configuration + +**Last updated**: 2026-01-06 + +This file defines available knowledge domains and their file path patterns. + +--- + +## Domains + +### frontend +**File**: `claude-learn-frontend.md` +**Patterns**: +- `src/**/*.vue` +- `src/**/*.js` +- `src/**/*.ts` +- `src/**/*.css` +- `vite.config.*` +- `package.json` + +--- + +### backend +**File**: `claude-learn-backend.md` +**Patterns**: +- `backend/**/*.py` +- `backend/modules/**/*` +- `requirements.txt` + +--- + +### database +**File**: `claude-learn-database.md` +**Patterns**: +- `**/*.sql` +- `**/models.py` +- `**/schemas.py` +- `backend/**/db/**/*` + +--- + +### testing +**File**: `claude-learn-testing.md` +**Patterns**: +- `tests/**/*` +- `**/*.test.*` +- `**/*.spec.*` +- `pytest.ini` +- `vitest.config.*` + +--- + +### deployment +**File**: `claude-learn-deployment.md` +**Patterns**: +- `deployment/**/*` +- `public/web.config` +- `Dockerfile*` +- `docker-compose*.yml` +- `*.sh` +- `ansible/**/*` + +--- + +### global +**File**: `claude-learn-global.md` +**Patterns**: +- `*` (catch-all for cross-cutting concerns) + +--- + +## Statistics + +| Domain | Patterns | Gotchas | Last Updated | +|--------|----------|---------|--------------| +| frontend | 8 | 10 | 2026-01-06 | +| deployment | 1 | 0 | 2026-01-06 | +| global | 0 | 1 | 2026-01-06 | +| backend | 2 | 0 | 2026-01-06 | +| database | 0 | 0 | - | +| testing | 0 | 0 | - | + +**Total**: 11 patterns, 11 gotchas across 4 domains diff --git a/.claude/rules/auto-build-memory.md b/.claude/rules/claude-learn-frontend.md similarity index 84% rename from .claude/rules/auto-build-memory.md rename to .claude/rules/claude-learn-frontend.md index d645842..bc43694 100644 --- a/.claude/rules/auto-build-memory.md +++ b/.claude/rules/claude-learn-frontend.md @@ -1,9 +1,10 @@ -# Learned Patterns & Gotchas +# Claude Learn: Frontend -**Last updated**: 2025-12-24 -**Maintained**: Manually (add new patterns/gotchas as discovered) +**Domain**: frontend +**Last updated**: 2026-01-06 +**Sessions recorded**: 3 -This file contains insights learned during feature implementations. Claude Code auto-loads this file to prevent repeating past mistakes. +Knowledge about Vue.js, Vite, Pinia, CSS, and frontend architecture. --- @@ -130,37 +131,6 @@ resolve: { --- -### IIS URL Rewrite Rules for SPA with Multiple API Backends -**Discovered**: 2025-12-22 (feature: unified-app) -**Description**: Configure IIS web.config to proxy different API paths to different backend ports while serving SPA for all other routes. Enables single IIS site to route to multiple microservices. - -**Example** (`public/web.config:5-28`): -```xml - - - - - - - - - - - - - - - - - - - -``` - -**Tags**: iis, deployment, spa, microservices, proxy - ---- - ### Vue Watcher for Auto-Loading Dependent Data **Discovered**: 2025-12-24 (feature: unified-app-ux) **Description**: Use Vue watch() to automatically trigger data loading when dependent selections change. Watch company selection changes to auto-load accounting periods, ensuring UI stays synchronized without manual intervention. @@ -248,15 +218,6 @@ const getStorageKey = () => { --- -### Sed Command Quote Mismatch in Bulk Find-Replace -**Discovered**: 2025-12-22 (feature: unified-app) -**Problem**: Bulk sed commands using single quotes in pattern didn't match imports using double quotes, and vice versa. Commands like sed 's|from '@/stores/'|...' didn't replace from "@/stores/" lines. -**Solution**: Always use the quote style that matches the target files. For Vue/JS files with ESLint using double quotes, use double quotes in sed patterns. Better yet: use find -exec with separate sed for each file to handle both quote styles. - -**Tags**: sed, regex, scripting, find-replace, migration - ---- - ### Circular Reference in API Wrapper **Discovered**: 2025-12-22 (feature: unified-app) **Problem**: receiptsStore.js failed to build with 'Identifier api has already been declared' because it imported api and then declared const api = { ... } wrapper object using the same name. @@ -287,7 +248,7 @@ const getStorageKey = () => { ### Vite Build Transform Count is Progress Indicator **Discovered**: 2025-12-22 (feature: unified-app) **Problem**: Hard to tell if build is making progress when fixing import issues. Each fix revealed new errors, causing frustration. -**Solution**: Watch the 'transforming... ✓ N modules transformed' count - it increases with each successful fix even if build ultimately fails. Going from 200→573→1490→1492 modules meant we were getting close to success. Use this as encouragement! +**Solution**: Watch the 'transforming... N modules transformed' count - it increases with each successful fix even if build ultimately fails. Going from 200->573->1490->1492 modules meant we were getting close to success. Use this as encouragement! **Tags**: vite, build, debugging, progress-tracking, developer-experience @@ -329,9 +290,9 @@ const getStorageKey = () => { --- -## Memory Statistics +## Statistics -- **Total Patterns**: 9 -- **Total Gotchas**: 11 -- **Last Session**: 2025-12-24 (unified-app-ux) -- **Sessions Recorded**: 2 +- **Total Patterns**: 8 +- **Total Gotchas**: 10 +- **Last Session**: 2026-01-06 +- **Sessions Recorded**: 3 diff --git a/.claude/rules/claude-learn-global.md b/.claude/rules/claude-learn-global.md new file mode 100644 index 0000000..3a27a79 --- /dev/null +++ b/.claude/rules/claude-learn-global.md @@ -0,0 +1,33 @@ +# Claude Learn: Global + +**Domain**: global +**Last updated**: 2026-01-06 +**Sessions recorded**: 1 + +Cross-cutting knowledge applicable to multiple domains (scripting, tooling, workflow). + +--- + +## Patterns + +_(None recorded yet)_ + +--- + +## Gotchas + +### Sed Command Quote Mismatch in Bulk Find-Replace +**Discovered**: 2025-12-22 (feature: unified-app) +**Problem**: Bulk sed commands using single quotes in pattern didn't match imports using double quotes, and vice versa. Commands like sed 's|from '@/stores/'|...' didn't replace from "@/stores/" lines. +**Solution**: Always use the quote style that matches the target files. For Vue/JS files with ESLint using double quotes, use double quotes in sed patterns. Better yet: use find -exec with separate sed for each file to handle both quote styles. + +**Tags**: sed, regex, scripting, find-replace, migration + +--- + +## Statistics + +- **Total Patterns**: 0 +- **Total Gotchas**: 1 +- **Last Session**: 2026-01-06 +- **Sessions Recorded**: 1 diff --git a/.claude/rules/claude-learn-testing.md b/.claude/rules/claude-learn-testing.md new file mode 100644 index 0000000..a1527bf --- /dev/null +++ b/.claude/rules/claude-learn-testing.md @@ -0,0 +1,28 @@ +# Claude Learn: Testing + +**Domain**: testing +**Last updated**: - +**Sessions recorded**: 0 + +Knowledge about pytest, Vitest, test patterns, and validation strategies. + +--- + +## Patterns + +_(None recorded yet)_ + +--- + +## Gotchas + +_(None recorded yet)_ + +--- + +## Statistics + +- **Total Patterns**: 0 +- **Total Gotchas**: 0 +- **Last Session**: - +- **Sessions Recorded**: 0 diff --git a/backend/modules/data_entry/routers/ocr.py b/backend/modules/data_entry/routers/ocr.py index c1846f1..4965182 100644 --- a/backend/modules/data_entry/routers/ocr.py +++ b/backend/modules/data_entry/routers/ocr.py @@ -628,3 +628,86 @@ def _dict_to_extraction_data(data: dict) -> ExtractionData: validation_errors=data.get('validation_errors', []), inter_ocr_ratios=data.get('inter_ocr_ratios', {}), ) + + +# ============================================================================ +# Store Profiles Management Endpoints +# ============================================================================ + +@router.post("/profiles/reload") +async def reload_store_profiles( + current_user: CurrentUser = Depends(get_current_user) +) -> dict: + """ + Hot-reload all store profiles. + + Reloads profile Python modules without server restart. + Use after adding/modifying profile files. + + Returns: + Dict with reloaded count and profile list + """ + from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry + + count = ProfileRegistry.reload_all() + status = ProfileRegistry.get_reload_status() + + return { + "success": True, + "reloaded_modules": count, + "profiles_count": status["profiles_count"], + "registered_cuis": status["registered_cuis"], + "last_reload": status["last_reload"], + } + + +@router.get("/profiles") +async def list_store_profiles( + current_user: CurrentUser = Depends(get_current_user) +) -> dict: + """ + List all registered store profiles. + + Returns: + Dict with profiles list and status + """ + from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry + + profiles = ProfileRegistry.list_profiles() + status = ProfileRegistry.get_reload_status() + + return { + "profiles": profiles, + "count": len(profiles), + "last_reload": status["last_reload"], + } + + +@router.get("/profiles/{cui}") +async def get_store_profile( + cui: str, + current_user: CurrentUser = Depends(get_current_user) +) -> dict: + """ + Get details for a specific store profile. + + Args: + cui: Store CUI (with or without RO prefix) + + Returns: + Profile details including validation hints + + Raises: + 404: If no profile exists for this CUI + """ + from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry + + info = ProfileRegistry.get_profile_info(cui) + + if not info: + raise HTTPException( + status_code=404, + detail=f"No profile registered for CUI: {cui}" + ) + + return info diff --git a/backend/modules/data_entry/services/ocr/profiles/README.md b/backend/modules/data_entry/services/ocr/profiles/README.md new file mode 100644 index 0000000..65c7695 --- /dev/null +++ b/backend/modules/data_entry/services/ocr/profiles/README.md @@ -0,0 +1,258 @@ +# Store Profiles - OCR Extraction + +Sistem de profile specifice pentru extracție OCR cu hot-reload. + +--- + +## Quick Start: Adaugă un profil nou + +```bash +# 1. Generează profil din PDF-uri (dry-run pentru preview) +python scripts/generate_store_profile.py \ + --name "Magazin Nou SRL" \ + --cui "12345678" \ + --receipts "docs/data-entry/MagazinNou*.pdf" \ + --dry-run + +# 2. Generează și salvează +python scripts/generate_store_profile.py \ + --name "Magazin Nou SRL" \ + --cui "12345678" \ + --receipts "docs/data-entry/MagazinNou*.pdf" \ + --output backend/modules/data_entry/services/ocr/profiles/magazin_nou.py + +# 3. Hot-reload (fără restart server) +curl -X POST http://localhost:8000/api/data-entry/ocr/profiles/reload + +# 4. Verifică +curl http://localhost:8000/api/data-entry/ocr/profiles +``` + +--- + +## Structura directorului + +``` +profiles/ +├── __init__.py # ProfileRegistry + hot-reload (~390 linii) +├── base.py # BaseStoreProfile + pattern-uri generice (~410 linii) +├── lidl.py # Multi-rate TVA (A/B) +├── omv.py # B2B, date YYYY.MM.DD +├── socar.py # B2B, date YYYY.MM.DD +├── brick.py # Standard TVA +├── dedeman.py # E-factura support +├── kineterra.py # Non-VAT payer +├── gama_ink.py # Standard TVA (toner/cartușe) +├── electrobering.py # Standard TVA (electronice) +├── pictus_velum.py # Standard TVA (rechizite) +├── unlimited_keys.py # Standard TVA, NUMERAR payment +├── best_print.py # Non-VAT payer (neplătitor TVA) +├── stepout_market.py # TVA 5% (cărți/librărie) +└── README.md # Acest fișier +``` + +--- + +## Profile existente (12 profile) + +> **Note**: Pattern-urile TVA sunt **flexibile** și acceptă ORICE cotă (5%, 9%, 11%, 19%, 21%, etc.) +> pentru a gestiona atât datele istorice cât și schimbările viitoare ale legislației. + +| Magazin | CUI | Fișier | Caracteristici | +|---------|-----|--------|----------------| +| LIDL DISCOUNT S.R.L. | 22891860 | `lidl.py` | Multi-rate TVA (coduri A, B, C, D) | +| OMV PETROM MARKETING S.R.L. | 11201891 | `omv.py` | B2B (client CUI), date YYYY.MM.DD | +| SOCAR PETROLEUM S.A. | 12546600 | `socar.py` | B2B (client CUI), date YYYY.MM.DD | +| FIVE-HOLDING S.A. (BRICK) | 10562600 | `brick.py` | Standard TVA | +| DEDEMAN SRL | 2816464 | `dedeman.py` | E-factura support | +| KINETERRA CONCEPT SRL | 31180432 | `kineterra.py` | Non-VAT payer (returnează `[]`) | +| GAMA INK SERVICE SRL | 17741882 | `gama_ink.py` | Standard TVA (toner, cartușe) | +| ELECTROBERING S.R.L. | 2744937 | `electrobering.py` | Standard TVA (electronice) | +| PICTUS VELUM SRL | 39634534 | `pictus_velum.py` | Standard TVA (rechizite) | +| UNLIMITED KEYS S.R.L. | 18993187 | `unlimited_keys.py` | Standard TVA, **NUMERAR** plată | +| BEST PRINT TRADE ACTIV SRL | 45417955 | `best_print.py` | **Non-VAT payer** (neplătitor TVA) | +| STEPOUT MARKET SRL | 35532655 | `stepout_market.py` | TVA 5% (cărți, librărie) | + +--- + +## API Endpoints + +| Endpoint | Metodă | Descriere | +|----------|--------|-----------| +| `/api/data-entry/ocr/profiles` | GET | Lista toate profilele | +| `/api/data-entry/ocr/profiles/{cui}` | GET | Detalii profil (acceptă RO prefix) | +| `/api/data-entry/ocr/profiles/reload` | POST | Hot-reload toate profilele | + +### Exemple API + +```bash +# Lista profile +curl http://localhost:8000/api/data-entry/ocr/profiles \ + -H "Authorization: Bearer " + +# Detalii profil (cu sau fără RO prefix) +curl http://localhost:8000/api/data-entry/ocr/profiles/22891860 +curl http://localhost:8000/api/data-entry/ocr/profiles/RO22891860 + +# Hot-reload după modificări +curl -X POST http://localhost:8000/api/data-entry/ocr/profiles/reload \ + -H "Authorization: Bearer " + +# Response reload: +{ + "success": true, + "reloaded_modules": 12, + "profiles_count": 12, + "registered_cuis": ["22891860", "11201891", "12546600", "10562600", ...], + "last_reload": "2026-01-06T22:37:05.000000" +} +``` + +--- + +## Cum funcționează sistemul + +### Flow de extracție + +``` +ReceiptExtractor.extract() + │ + ├─► STEP 1: Extrage vendor + CUI + │ └─► _extract_vendor(), _extract_cui() + │ + ├─► ProfileRegistry.get_profile(cui) + │ └─► Returnează profil specific sau None + │ + ├─► STEP 2: Extracție cu profil (dacă există) + │ ├─► profile.extract_total() + │ ├─► profile.extract_date() + │ ├─► profile.extract_receipt_number() + │ ├─► profile.extract_tva_entries() + │ ├─► profile.extract_payment_methods() + │ └─► profile.extract_client_cui() + │ + └─► STEP 3-4: Validare + post-procesare +``` + +### Fallback + +Dacă nu există profil pentru CUI, se folosește logica generică din `ReceiptExtractor`. + +--- + +## Structura unui profil + +```python +from .base import BaseStoreProfile +from . import ProfileRegistry + +@ProfileRegistry.register +class MagazinNouProfile(BaseStoreProfile): + """Docstring cu descriere magazin.""" + + CUI_LIST = ["12345678"] # Poate avea mai multe CUI-uri + NAME_PATTERNS = ["MAGAZIN", "MAGAZIN NOU", "MAG4ZIN"] # OCR variants + STORE_NAME = "Magazin Nou SRL" + + # Override doar ce e diferit de base class + def extract_tva_entries(self, text: str) -> List[dict]: + # Pattern-uri specifice magazinului + ... + + def get_validation_hints(self) -> Dict[str, Any]: + return { + "has_multi_rate_tva": False, + "card_equals_total": True, + "has_client_cui": False, + "has_efactura": False, + "is_non_vat_payer": False, + } +``` + +--- + +## Pattern-uri disponibile în base.py + +BaseStoreProfile include pattern-uri generice OCR-tolerant: + +| Pattern | Descriere | +|---------|-----------| +| `TOTAL_PATTERNS` | 8 variante pentru TOTAL (TOTAL:, TOTAL DE PLATA, etc.) | +| `DATE_PATTERNS` | 6 variante (DD.MM.YYYY, YYYY-MM-DD, DD/MM/YYYY) | +| `DATE_PATTERNS_OCR_SPACES` | 4 variante cu spații OCR ("2025. 08. 14") | +| `NUMBER_PATTERNS` | 11 variante pentru număr bon (NDS, BF, C3POS) | +| `PAYMENT_PATTERNS` | 8 variante pentru CARD/NUMERAR | +| `CLIENT_MARKERS` | 6 variante pentru secțiune CLIENT | +| `CLIENT_CUI_PATTERNS` | 7 variante pentru CUI client | + +### Metode implementate în base class + +- `extract_total(text)` → `Tuple[Decimal, float]` +- `extract_date(text)` → `Tuple[date, float]` +- `extract_receipt_number(text)` → `Tuple[str, float]` +- `extract_payment_methods(text)` → `List[dict]` +- `extract_client_cui(text)` → `Tuple[str, float]` +- `extract_client_name(text)` → `Tuple[str, float]` + +--- + +## Când ai nevoie de profil custom? + +| Situație | Exemplu | Ce trebuie override | +|----------|---------|---------------------| +| **Multi-rate TVA** | Lidl (TVA A, TVA B) | `extract_tva_entries()` | +| **Format dată special** | OMV/Socar (YYYY.MM.DD) | `DATE_PATTERNS_OCR_SPACES` | +| **B2B receipts** | Benzinării (au client CUI) | `extract_client_cui()` | +| **Non-VAT payer** | Kineterra | `extract_tva_entries()` returnează `[]` | +| **E-factura** | Dedeman | `extract_efactura_reference()` | + +--- + +## Decizii de design + +1. **Hot-reload manual** - endpoint `/profiles/reload` apelat când se modifică fișiere +2. **Persistență în Python** - profile în Git, version controlled +3. **Fallback graceful** - dacă nu există profil, folosește logica generică +4. **CUI normalization** - gestionează automat prefixul "RO" și whitespace +5. **Deduplicare TVA** - folosește `seen = set()` pentru a evita duplicate + +--- + +## Comenzi utile + +```bash +# Verifică syntax Python pentru toate profilele +for f in backend/modules/data_entry/services/ocr/profiles/*.py; do + python3 -m py_compile "$f" && echo "✓ $(basename $f)" +done + +# Lista profile +ls -la backend/modules/data_entry/services/ocr/profiles/ + +# Pornește backend pentru testare +cd backend && source venv/bin/activate +uvicorn main:app --host 0.0.0.0 --port 8000 --workers 1 + +# Test OCR pe un PDF +curl -X POST -F "file=@docs/data-entry/test.pdf" \ + -H "Authorization: Bearer " \ + "http://localhost:8000/api/data-entry/ocr/extract?engine=doctr_plus" +``` + +--- + +## Script generare profile + +`scripts/generate_store_profile.py` - generator automat de profile + +```bash +# Vezi help +python scripts/generate_store_profile.py --help + +# Funcționalități: +# - Analizează PDF-uri via OCR API +# - Detectează: TVA format, date format, payment patterns, B2B +# - Generează cod Python cu OCR error variants +# - Suportă glob patterns (*.pdf) +# - Verifică sintaxa după generare +``` diff --git a/backend/modules/data_entry/services/ocr/profiles/__init__.py b/backend/modules/data_entry/services/ocr/profiles/__init__.py new file mode 100644 index 0000000..cb0be54 --- /dev/null +++ b/backend/modules/data_entry/services/ocr/profiles/__init__.py @@ -0,0 +1,388 @@ +""" +Store Profiles Registry with Hot-Reload Support. + +This module provides a registry for store-specific OCR extraction profiles. +Profiles can be reloaded at runtime without restarting the server. + +Usage: + from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry + + # Get profile for a CUI + profile = ProfileRegistry.get_profile("22891860") + if profile: + tva_entries = profile.extract_tva_entries(text) + + # Reload all profiles (after file changes) + count = ProfileRegistry.reload_all() + +Architecture: + - ProfileRegistry: Singleton registry with class methods + - BaseStoreProfile: Abstract base class for profiles + - @ProfileRegistry.register: Decorator for profile classes + +Hot-Reload Mechanism: + 1. Admin calls POST /profiles/reload endpoint + 2. Registry clears instance cache + 3. importlib.reload() re-executes each profile module + 4. @register decorator re-registers classes with new code +""" + +from __future__ import annotations + +import importlib +import logging +import sys +from datetime import datetime +from pathlib import Path +from typing import Dict, List, Optional, Type, TYPE_CHECKING + +if TYPE_CHECKING: + from .base import BaseStoreProfile + +logger = logging.getLogger(__name__) + +# Directory containing profile modules +PROFILES_DIR = Path(__file__).parent + + +class ProfileRegistry: + """ + Registry for store-specific OCR extraction profiles. + + Uses class methods for singleton-like behavior without explicit instantiation. + Supports hot-reload via importlib.reload() for runtime updates. + + Attributes: + _profiles: Maps CUI -> profile class (not instance) + _instances: Maps CUI -> profile instance (lazy, cleared on reload) + _last_reload: Timestamp of last reload + _loaded: Whether initial load has been performed + """ + + # Class-level storage (singleton pattern via class methods) + _profiles: Dict[str, Type["BaseStoreProfile"]] = {} + _instances: Dict[str, "BaseStoreProfile"] = {} + _last_reload: Optional[datetime] = None + _loaded: bool = False + + # ------------------------------------------------------------------------- + # Registration + # ------------------------------------------------------------------------- + + @classmethod + def register(cls, profile_class: Type["BaseStoreProfile"]) -> Type["BaseStoreProfile"]: + """ + Decorator to register a store profile class. + + Registers the profile for all CUIs in the class's CUI_LIST. + Safe for re-registration during hot-reload (overwrites existing). + + Usage: + @ProfileRegistry.register + class LidlProfile(BaseStoreProfile): + CUI_LIST = ["22891860"] + ... + + Args: + profile_class: Profile class to register + + Returns: + The same class (allows use as decorator) + + Raises: + ValueError: If CUI_LIST is empty + """ + cui_list = getattr(profile_class, 'CUI_LIST', []) + store_name = getattr(profile_class, 'STORE_NAME', profile_class.__name__) + + if not cui_list: + logger.warning(f"Profile {profile_class.__name__} has empty CUI_LIST, skipping") + return profile_class + + # Register for each CUI + for cui in cui_list: + # Normalize CUI (remove RO prefix, strip whitespace) + normalized_cui = cls._normalize_cui(cui) + + if normalized_cui in cls._profiles: + old_class = cls._profiles[normalized_cui] + logger.debug( + f"Re-registering CUI {normalized_cui}: " + f"{old_class.__name__} -> {profile_class.__name__}" + ) + # Clear cached instance for this CUI + cls._instances.pop(normalized_cui, None) + + cls._profiles[normalized_cui] = profile_class + logger.debug(f"Registered profile {profile_class.__name__} for CUI {normalized_cui}") + + logger.info(f"Registered {store_name} for CUIs: {cui_list}") + return profile_class + + # ------------------------------------------------------------------------- + # Lookup + # ------------------------------------------------------------------------- + + @classmethod + def get_profile(cls, cui: Optional[str]) -> Optional["BaseStoreProfile"]: + """ + Get profile instance for a CUI. + + Uses lazy instantiation - creates instance on first access. + Returns None if no profile is registered for this CUI. + + Args: + cui: CUI to lookup (with or without RO prefix) + + Returns: + Profile instance or None + """ + if not cui: + return None + + # Ensure profiles are loaded + if not cls._loaded: + cls._load_all_profiles() + + normalized_cui = cls._normalize_cui(cui) + + # Check if profile exists + profile_class = cls._profiles.get(normalized_cui) + if not profile_class: + return None + + # Lazy instantiation + if normalized_cui not in cls._instances: + try: + cls._instances[normalized_cui] = profile_class() + logger.debug(f"Instantiated {profile_class.__name__} for CUI {normalized_cui}") + except Exception as e: + logger.error(f"Failed to instantiate {profile_class.__name__}: {e}") + return None + + return cls._instances[normalized_cui] + + @classmethod + def has_profile(cls, cui: Optional[str]) -> bool: + """Check if a profile exists for this CUI.""" + if not cui: + return False + if not cls._loaded: + cls._load_all_profiles() + return cls._normalize_cui(cui) in cls._profiles + + # ------------------------------------------------------------------------- + # Listing + # ------------------------------------------------------------------------- + + @classmethod + def list_profiles(cls) -> List[Dict]: + """ + List all registered profiles. + + Returns: + List of dicts with cui, class_name, store_name, name_patterns + """ + if not cls._loaded: + cls._load_all_profiles() + + result = [] + seen_classes = set() + + for cui, profile_class in cls._profiles.items(): + # Avoid duplicates for profiles with multiple CUIs + if profile_class.__name__ in seen_classes: + continue + seen_classes.add(profile_class.__name__) + + result.append({ + "cuis": list(getattr(profile_class, 'CUI_LIST', [])), + "class_name": profile_class.__name__, + "store_name": getattr(profile_class, 'STORE_NAME', profile_class.__name__), + "name_patterns": list(getattr(profile_class, 'NAME_PATTERNS', [])), + }) + + return result + + @classmethod + def get_profile_info(cls, cui: str) -> Optional[Dict]: + """ + Get detailed info about a profile. + + Args: + cui: CUI to lookup + + Returns: + Dict with profile details or None + """ + profile = cls.get_profile(cui) + if not profile: + return None + + return { + "cui": cui, + "cuis": list(profile.CUI_LIST), + "class_name": profile.__class__.__name__, + "store_name": profile.STORE_NAME, + "name_patterns": list(profile.NAME_PATTERNS), + "validation_hints": profile.get_validation_hints(), + } + + # ------------------------------------------------------------------------- + # Hot-Reload + # ------------------------------------------------------------------------- + + @classmethod + def reload_all(cls) -> int: + """ + Hot-reload all profile modules. + + Clears instance cache and reloads all .py files in profiles directory. + Decorator re-registers classes with updated code. + + Returns: + Number of modules reloaded + """ + logger.info("Starting profile hot-reload...") + + # Clear instance cache (will be recreated on next get_profile) + cls._instances.clear() + + # Get list of profile modules (exclude __init__, base) + module_names = cls._get_profile_module_names() + + count = 0 + for module_name in module_names: + full_name = f"backend.modules.data_entry.services.ocr.profiles.{module_name}" + + try: + if full_name in sys.modules: + # Reload existing module + importlib.reload(sys.modules[full_name]) + logger.debug(f"Reloaded module: {module_name}") + else: + # Import new module + importlib.import_module(full_name) + logger.debug(f"Imported new module: {module_name}") + count += 1 + except Exception as e: + logger.error(f"Failed to reload {module_name}: {e}") + + cls._last_reload = datetime.utcnow() + cls._loaded = True + + logger.info(f"Profile hot-reload complete: {count} modules, {len(cls._profiles)} profiles") + return count + + @classmethod + def get_reload_status(cls) -> Dict: + """Get status of the registry including last reload time.""" + return { + "loaded": cls._loaded, + "last_reload": cls._last_reload.isoformat() if cls._last_reload else None, + "profiles_count": len(cls._profiles), + "instances_count": len(cls._instances), + "registered_cuis": list(cls._profiles.keys()), + } + + # ------------------------------------------------------------------------- + # Internal methods + # ------------------------------------------------------------------------- + + @classmethod + def _normalize_cui(cls, cui: str) -> str: + """ + Normalize CUI for consistent lookup. + + - Removes RO prefix (with or without space) + - Strips whitespace + - Converts to uppercase + + Args: + cui: Raw CUI string + + Returns: + Normalized CUI (digits only) + """ + if not cui: + return "" + + cui = str(cui).strip().upper() + + # Remove RO prefix (handles "RO12345" and "RO 12345") + if cui.startswith("RO"): + cui = cui[2:].lstrip() + + return cui.strip() + + @classmethod + def _get_profile_module_names(cls) -> List[str]: + """ + Get list of profile module names from profiles directory. + + Excludes __init__.py and base.py. + + Returns: + List of module names (without .py extension) + """ + excluded = {"__init__", "base", "__pycache__"} + modules = [] + + for path in PROFILES_DIR.glob("*.py"): + name = path.stem + if name not in excluded: + modules.append(name) + + return sorted(modules) + + @classmethod + def _load_all_profiles(cls) -> None: + """ + Initial load of all profile modules. + + Called automatically on first get_profile() if not already loaded. + """ + if cls._loaded: + return + + logger.info("Loading store profiles...") + + module_names = cls._get_profile_module_names() + + for module_name in module_names: + full_name = f"backend.modules.data_entry.services.ocr.profiles.{module_name}" + try: + importlib.import_module(full_name) + logger.debug(f"Loaded module: {module_name}") + except Exception as e: + logger.error(f"Failed to load {module_name}: {e}") + + cls._loaded = True + cls._last_reload = datetime.utcnow() + + logger.info(f"Loaded {len(cls._profiles)} store profiles") + + @classmethod + def clear(cls) -> None: + """ + Clear all registered profiles. + + Mainly useful for testing. + """ + cls._profiles.clear() + cls._instances.clear() + cls._loaded = False + cls._last_reload = None + + +# ------------------------------------------------------------------------- +# Module exports +# ------------------------------------------------------------------------- + +__all__ = [ + "ProfileRegistry", + "BaseStoreProfile", +] + +# Re-export BaseStoreProfile for convenience +from .base import BaseStoreProfile diff --git a/backend/modules/data_entry/services/ocr/profiles/base.py b/backend/modules/data_entry/services/ocr/profiles/base.py new file mode 100644 index 0000000..1dd718e --- /dev/null +++ b/backend/modules/data_entry/services/ocr/profiles/base.py @@ -0,0 +1,515 @@ +""" +Base class for store-specific OCR extraction profiles. + +Each store can have different receipt formats (TVA layout, total position, etc.). +Store profiles allow customizing extraction logic per-store for better accuracy. + +Usage: + from .base import BaseStoreProfile + from . import ProfileRegistry + + @ProfileRegistry.register + class LidlProfile(BaseStoreProfile): + CUI_LIST = ["22891860"] + NAME_PATTERNS = ["LIDL", "LDL"] + + def extract_tva_entries(self, text: str) -> List[dict]: + # Custom Lidl TVA extraction logic + ... +""" + +import re +from abc import ABC +from decimal import Decimal, InvalidOperation +from typing import List, Optional, Tuple, Dict, Any +from datetime import date + + +class BaseStoreProfile(ABC): + """ + Abstract base class for store-specific extraction profiles. + + Each profile defines: + - CUI_LIST: CUI codes that identify this store (without RO prefix) + - NAME_PATTERNS: OCR-tolerant name patterns for fallback matching + - Custom extraction methods for TVA, total, date, etc. + + The ProfileRegistry uses CUI_LIST to lookup profiles during extraction. + """ + + # ------------------------------------------------------------------------- + # Class attributes - override in subclasses + # ------------------------------------------------------------------------- + + # List of CUI codes (without RO prefix) that identify this store + CUI_LIST: List[str] = [] + + # OCR-tolerant name patterns for fallback matching + NAME_PATTERNS: List[str] = [] + + # Store display name + STORE_NAME: str = "Unknown Store" + + # ------------------------------------------------------------------------- + # Generic patterns - can be overridden in subclasses + # ------------------------------------------------------------------------- + + # Total amount patterns (confidence-weighted) + TOTAL_PATTERNS = [ + (r'T[O0]TAL[.\s]+L[E3][I1!]\s*:?\s*([\d\s.,]+)', 0.98), + (r'TOTAL\s+LEI\s*([\d\s.,]+)', 0.98), + (r'[OT]?OTAL\s+LEI\s*([\d\s.,]+)', 0.95), + (r'TOTAL\s*:?\s*([\d\s.,]+)\s*(?:RON|LEI)?', 0.95), + (r'TOTAL\s+(?:RON|LEI)\s*([\d\s.,]+)', 0.95), + (r'SUBTOTAL\s*([\d\s.,]+)', 0.90), + (r'DE\s+PLATA\s*:?\s*([\d\s.,]+)', 0.90), + (r'SUMA\s*:?\s*([\d\s.,]+)', 0.85), + ] + + # Date patterns (confidence-weighted) + DATE_PATTERNS = [ + (r'D[AR]TA\s*:?\s*(\d{2}[-./]\d{2}[-./]\d{4})', 0.98), + (r'DATA\s*:?\s*(\d{2}[-./]\d{2}[-./]\d{4})', 0.98), + (r'(\d{2}[-./]\d{2}[-./]\d{4})\s+[O0]RA\s*:?\s*\d{2}:\d{2}', 0.95), + (r'(\d{2}[-./]\d{2}[-./]\d{4})\s+\d{2}:\d{2}', 0.90), + (r'(\d{2}[-./]\d{2}[-./]\d{4})', 0.80), + (r'(\d{4}[-./]\d{2}[-./]\d{2})', 0.75), + ] + + # Date patterns with OCR-introduced spaces (separate because format is different) + DATE_PATTERNS_OCR_SPACES = [ + (r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})\s+\d{2}:\d{2}', 0.92, 'ymd'), + (r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})', 0.85, 'ymd'), + (r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})\s+\d{2}:\d{2}', 0.92, 'dmy'), + (r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})', 0.85, 'dmy'), + ] + + # Receipt number patterns (confidence-weighted) + NUMBER_PATTERNS = [ + (r'NDS\s*:?\s*(\d+)', 0.98), + (r'C3POS[-A-Z0-9]*[N:](\d{6,7})', 0.98), + (r'C3POS.*?(\d{6,7})\b', 0.95), + (r'BF\s*:\s*(\d{4,})', 0.96), + (r'BF\s+(\d{4,})', 0.93), + (r'NIVS\s*:?\s*(\d+)', 0.95), + (r'NR\.?\s*BON\s*:?\s*(\d+)', 0.95), + (r'BON\s+(?:FISCAL\s+)?NR\.?\s*:?\s*(\d+)', 0.95), + (r'CHITANTA\s+NR\.?\s*:?\s*(\d+)', 0.95), + (r'NR\.?\s+DOCUMENT\s*:?\s*(\d+)', 0.90), + (r'ID\s*BF\s*:?\s*(\d+)', 0.90), + ] + + # Payment method patterns (pattern, method_type, confidence) + PAYMENT_PATTERNS = [ + (r'CARTE\s+CREDIT\s*:?\s*([\d\s.,]+)', 'CARD', 0.98), + (r'CARTE\s+CREDIT\s*:?\s*\n\s*([\d\s.,]+)', 'CARD', 0.97), + (r'(?:PLATA\s+)?CARD\s*[:\sA-Z]?\s*([\d\s.,]+)', 'CARD', 0.95), + (r'NUMERAR\s*:?\s*([\d\s.,]+)', 'NUMERAR', 0.95), + (r'CASH\s*:?\s*([\d\s.,]+)', 'NUMERAR', 0.90), + (r'(?:^|\n|\s)RD\s*:?\s*(\d{1,6}[.,]\d{2})\b', 'CARD', 0.70), + (r'(?:^|\n|\s)ARD\s*:?\s*(\d{1,6}[.,]\d{2})\b', 'CARD', 0.75), + (r'(?:^|\n|\s)MERAR\s*:?\s*(\d{1,6}[.,]\d{2})\b', 'NUMERAR', 0.70), + ] + + # Client section markers (for B2B receipts) + CLIENT_MARKERS = [ + r'C\.?\s*[I1]\.?\s*F\.?\s+CLIENT\s*:', + r'C\.?\s*U\.?\s*[I1]\.?\s+CLIENT\s*:', + r'CLIENT\s+C\.?\s*[UI1]\.?\s*[IF1]\.?\s*:', + r'CLIENT\s*:', + r'CUMPARATOR\s*:', + r'BENEFICIAR\s*:', + ] + + # Client CUI patterns (pattern, confidence) + CLIENT_CUI_PATTERNS = [ + (r'(R[O0]\d{6,10})\s*\n\s*CLIENT\s+C\.?\s*U\.?\s*[I1]\.?', 0.99), + (r'(R[O0]\d{6,10})\s*:?\s*\n\s*CLIENT', 0.98), + (r'C[I1]F\s+[A-Z]*\s*CLIENT\s*:?\s*(R[O0]\d{6,10})', 0.98), + (r'C\.?\s*[I1]\.?\s*F\.?\s+CLIENT\s*:?\s*(R[O0]?\d{6,10})', 0.98), + (r'C\.?\s*U\.?\s*[I1]\.?\s+CLIENT\s*:?\s*(R[O0]?\d{6,10})', 0.98), + (r'CLIENT\s+C\.?\s*U\.?\s*[I1]\.?\s*:?\s*(R[O0]?\d{6,10})', 0.95), + (r'CLIENT\s*:?\s*(R[O0]?\d{6,10})', 0.90), + ] + + # Company type indicators (for identifying company names) + COMPANY_INDICATORS = [ + r'\bS\.?\s*R\.?\s*L\.?\b', # S.R.L. or S. R. L. + r'\bS\.?\s*A\.?\b', # S.A. or S. A. + r'\bS\.?\s*N\.?\s*C\.?\b', # S.N.C. or S. N. C. + r'\bS\.?\s*C\.?\s*S\.?\b', # S.C.S. or S. C. S. + r'\bI\.?\s*I\.?\b', # I.I. or I. I. + r'\bP\.?\s*F\.?\s*A\.?\b', # P.F.A. or P. F. A. + r'\bS\.?\s*C\.?\s+[A-Z]', # S.C. followed by company name + r'HOLDING', + r'COMPANY', + r'GROUP', + ] + + # Maximum reasonable payment amount (to filter OCR errors) + MAX_PAYMENT = Decimal('100000') + + # ------------------------------------------------------------------------- + # Extraction methods - override in subclasses as needed + # ------------------------------------------------------------------------- + + def extract_tva_entries(self, text: str) -> List[dict]: + """ + Extract TVA entries from receipt text. + + Override this method in subclasses to handle store-specific TVA formats. + + Args: + text: Raw OCR text from receipt + + Returns: + List of dicts with keys: code, percent, amount + """ + return [] + + def extract_total(self, text: str) -> Tuple[Optional[Decimal], float]: + """ + Extract total amount from receipt text. + + Args: + text: Raw OCR text from receipt + + Returns: + Tuple of (amount, confidence) or (None, 0.0) + """ + text_upper = text.upper() + + for pattern, confidence in self.TOTAL_PATTERNS: + match = re.search(pattern, text_upper) + if match: + amount = self._parse_decimal(match.group(1)) + if amount and amount > 0 and amount < self.MAX_PAYMENT: + return (amount, confidence) + + return (None, 0.0) + + def extract_date(self, text: str) -> Tuple[Optional[date], float]: + """ + Extract receipt date from text. + + Args: + text: Raw OCR text from receipt + + Returns: + Tuple of (date, confidence) or (None, 0.0) + """ + text_upper = text.upper() + + # Try standard patterns first + for pattern, confidence in self.DATE_PATTERNS: + match = re.search(pattern, text_upper) + if match: + parsed = self._parse_date(match.group(1)) + if parsed: + return (parsed, confidence) + + # Try OCR-corrupted patterns with spaces + for pattern, confidence, fmt in self.DATE_PATTERNS_OCR_SPACES: + match = re.search(pattern, text_upper) + if match: + try: + if fmt == 'ymd': + year, month, day = int(match.group(1)), int(match.group(2)), int(match.group(3)) + else: # dmy + day, month, year = int(match.group(1)), int(match.group(2)), int(match.group(3)) + + if 1 <= day <= 31 and 1 <= month <= 12 and 2000 <= year <= 2100: + return (date(year, month, day), confidence) + except (ValueError, TypeError): + continue + + return (None, 0.0) + + def extract_receipt_number(self, text: str) -> Tuple[Optional[str], float]: + """ + Extract receipt number from text. + + Args: + text: Raw OCR text from receipt + + Returns: + Tuple of (number, confidence) or (None, 0.0) + """ + text_upper = text.upper() + + for pattern, confidence in self.NUMBER_PATTERNS: + match = re.search(pattern, text_upper) + if match: + number = match.group(1).strip() + if number and len(number) >= 3: + return (number, confidence) + + return (None, 0.0) + + def extract_payment_methods(self, text: str) -> List[dict]: + """ + Extract payment methods (CARD/NUMERAR) from receipt. + + Supports multiple payments of the same type (e.g., 2x CARD for split payments). + Each payment is returned as a separate entry with its amount. + + Args: + text: Raw OCR text from receipt + + Returns: + List of dicts: [{'method': 'CARD'/'NUMERAR', 'amount': Decimal, 'confidence': float}] + Multiple entries of same method type are allowed for split payments. + """ + text_upper = text.upper() + methods = [] + # Track (method, amount) pairs to avoid exact duplicates from overlapping patterns + seen_entries = set() + + for pattern, method, confidence in self.PAYMENT_PATTERNS: + for match in re.finditer(pattern, text_upper): + try: + amount = self._parse_decimal(match.group(1)) + if amount and amount > 0 and amount < self.MAX_PAYMENT: + # Deduplicate by (method, amount) to avoid same entry from multiple patterns + # But allow different amounts for same method (split payments) + entry_key = (method, amount) + if entry_key not in seen_entries: + methods.append({ + 'method': method, + 'amount': amount, + 'confidence': confidence + }) + seen_entries.add(entry_key) + except (ValueError, InvalidOperation): + continue + + return methods + + def extract_client_cui(self, text: str) -> Tuple[Optional[str], float]: + """ + Extract client CUI from B2B receipts. + + Args: + text: Raw OCR text from receipt + + Returns: + Tuple of (cui, confidence) or (None, 0.0) + """ + text_upper = text.upper() + + # First check if there's a CLIENT section + has_client_section = any( + re.search(marker, text_upper, re.IGNORECASE) + for marker in self.CLIENT_MARKERS + ) + + if not has_client_section: + return (None, 0.0) + + # Try to extract CUI + for pattern, confidence in self.CLIENT_CUI_PATTERNS: + match = re.search(pattern, text_upper, re.IGNORECASE | re.MULTILINE) + if match: + cui = match.group(1) + # Normalize: remove RO prefix for storage + cui_digits = re.sub(r'[^0-9]', '', cui) + if 6 <= len(cui_digits) <= 10: + return (cui_digits, confidence) + + return (None, 0.0) + + def extract_client_name(self, text: str) -> Tuple[Optional[str], float]: + """ + Extract client/buyer company name from B2B receipts. + + Args: + text: Raw OCR text from receipt + + Returns: + Tuple of (client_name, confidence) or (None, 0.0) + """ + text_upper = text.upper() + lines = text.split('\n') + + # First check if there's a CLIENT section + client_section_idx = None + for i, line in enumerate(lines): + line_upper = line.upper().strip() + if any(re.search(marker, line_upper, re.IGNORECASE) for marker in self.CLIENT_MARKERS): + client_section_idx = i + break + + if client_section_idx is None: + return (None, 0.0) + + # Look for company name in CLIENT section + line = lines[client_section_idx].strip() + line_upper = line.upper() + + # Strategy 1: Check if name is on same line after ":" + if ':' in line: + name_part = line.split(':', 1)[1].strip() + if name_part and len(name_part) >= 3: + # Skip if it looks like a CUI (RO followed by digits) + if re.match(r'^R[O0]?\d{6,10}$', name_part.upper()): + pass # This is CUI, not name - continue to next strategy + else: + # Check for company indicators + name_upper = name_part.upper() + if any(re.search(ind, name_upper) for ind in self.COMPANY_INDICATORS): + return (self._clean_company_name(name_part), 0.95) + elif len(name_part) >= 5 and not name_part.isdigit(): + return (self._clean_company_name(name_part), 0.80) + + # Strategy 2: Check next line for company name + if client_section_idx + 1 < len(lines): + next_line = lines[client_section_idx + 1].strip() + next_upper = next_line.upper() + + # Skip if it's a CUI/CIF line or looks like CUI + if not re.search(r'C\.?\s*[UI]\.?\s*[IF]\.?', next_upper): + if not re.match(r'^R[O0]?\d{6,10}$', next_upper): + if any(re.search(ind, next_upper) for ind in self.COMPANY_INDICATORS): + return (self._clean_company_name(next_line), 0.90) + elif len(next_line) >= 5 and not next_line.isdigit(): + # Check it's not CUI/CIF/COD keywords + if not any(kw in next_upper for kw in ['CUI', 'CIF', 'COD', 'FISCAL']): + return (self._clean_company_name(next_line), 0.75) + + # Strategy 3: Look for any line with company indicators in CLIENT section region + search_end = min(client_section_idx + 5, len(lines)) + for i in range(client_section_idx + 1, search_end): + line = lines[i].strip() + line_upper = line.upper() + + # Skip CUI/CIF lines + if re.search(r'C\.?\s*[UI]\.?\s*[IF]\.?', line_upper): + continue + if re.match(r'^R[O0]?\d{6,10}$', line_upper): + continue + + if any(re.search(ind, line_upper) for ind in self.COMPANY_INDICATORS): + return (self._clean_company_name(line), 0.85) + + return (None, 0.0) + + @staticmethod + def _clean_company_name(name: str) -> str: + """Clean company name for storage.""" + if not name: + return "" + # Remove extra whitespace + name = re.sub(r'\s+', ' ', name).strip() + # Remove trailing punctuation except periods in S.R.L., S.A., etc. + name = re.sub(r'[,;:]+$', '', name).strip() + return name + + # ------------------------------------------------------------------------- + # Validation hints - override to customize validation behavior + # ------------------------------------------------------------------------- + + def get_validation_hints(self) -> Dict[str, Any]: + """ + Return validation hints for this store. + + Returns: + Dict with validation hints. Common keys: + - has_multi_rate_tva: bool - Store uses multiple TVA rates + - card_equals_total: bool - CARD payment equals total + - has_client_cui: bool - Receipt includes client CUI + - has_efactura: bool - Store uses e-factura format + - is_non_vat_payer: bool - Store is not a VAT payer + """ + return {} + + # ------------------------------------------------------------------------- + # Helper methods - available to all subclasses + # ------------------------------------------------------------------------- + + @staticmethod + def _normalize_number(text: str) -> str: + """ + Normalize a number string for Decimal conversion. + + Handles Romanian formats: "1.234,56" -> "1234.56" + """ + if not text: + return "0" + + # Remove spaces + text = text.replace(" ", "") + + # Determine decimal separator + last_comma = text.rfind(",") + last_dot = text.rfind(".") + + if last_comma > last_dot: + text = text.replace(".", "").replace(",", ".") + elif last_dot > last_comma: + text = text.replace(",", "") + else: + text = text.replace(",", ".") + + return text + + @staticmethod + def _parse_decimal(text: str) -> Optional[Decimal]: + """Parse a string to Decimal, handling various formats.""" + try: + normalized = BaseStoreProfile._normalize_number(text) + return Decimal(normalized) + except (InvalidOperation, ValueError, TypeError): + return None + + @staticmethod + def _parse_date(text: str) -> Optional[date]: + """ + Parse date string in various formats. + + Supports: DD-MM-YYYY, DD/MM/YYYY, DD.MM.YYYY, YYYY-MM-DD + """ + if not text: + return None + + # Normalize separators + text = text.replace('/', '-').replace('.', '-') + + try: + parts = text.split('-') + if len(parts) != 3: + return None + + # Determine format based on first part length + if len(parts[0]) == 4: + # YYYY-MM-DD + year, month, day = int(parts[0]), int(parts[1]), int(parts[2]) + else: + # DD-MM-YYYY + day, month, year = int(parts[0]), int(parts[1]), int(parts[2]) + + # Validate ranges + if 1 <= day <= 31 and 1 <= month <= 12 and 2000 <= year <= 2100: + return date(year, month, day) + except (ValueError, TypeError, IndexError): + pass + + return None + + @staticmethod + def _clean_text(text: str) -> str: + """Clean OCR text for pattern matching.""" + if not text: + return "" + text = re.sub(r'\s+', ' ', text) + text = re.sub(r'[\x00-\x09\x0b\x0c\x0e-\x1f\x7f]', '', text) + return text.strip() + + # ------------------------------------------------------------------------- + # Magic methods + # ------------------------------------------------------------------------- + + def __repr__(self) -> str: + return f"<{self.__class__.__name__} CUI={self.CUI_LIST}>" + + def __str__(self) -> str: + return f"{self.STORE_NAME} ({', '.join(self.CUI_LIST)})" diff --git a/backend/modules/data_entry/services/ocr/profiles/best_print.py b/backend/modules/data_entry/services/ocr/profiles/best_print.py new file mode 100644 index 0000000..16cb427 --- /dev/null +++ b/backend/modules/data_entry/services/ocr/profiles/best_print.py @@ -0,0 +1,54 @@ +""" +BEST PRINT TRADE ACTIV SRL store profile for OCR extraction. + +Stamp manufacturing service. Non-VAT payer (neplătitor de TVA). +""" + +from typing import List, Dict, Any + +from .base import BaseStoreProfile +from . import ProfileRegistry + + +@ProfileRegistry.register +class BestPrintProfile(BaseStoreProfile): + """ + BEST PRINT TRADE ACTIV SRL - non-VAT payer profile. + + Key characteristics: + - Non-VAT payer (neplătitor de TVA) - NO TVA on receipts + - Stamp manufacturing and printing services + - Total amount has no TVA component + - CARD payment typical + """ + + CUI_LIST = ["45417955"] + NAME_PATTERNS = ["BEST PRINT", "BESTPRINT", "BEST PRINT TRADE", "BEST PR1NT"] + STORE_NAME = "BEST PRINT TRADE ACTIV SRL" + + def extract_tva_entries(self, text: str) -> List[dict]: + """ + Extract TVA entries - returns empty for non-VAT payer. + + BEST PRINT is a non-VAT payer (neplătitor de TVA), + so no TVA entries are expected on receipts. + + Args: + text: Raw OCR text from receipt (unused) + + Returns: + Empty list (non-VAT payer has no TVA) + """ + # Non-VAT payer - no TVA entries + return [] + + def get_validation_hints(self) -> Dict[str, Any]: + """Return BEST PRINT-specific validation hints.""" + return { + "has_multi_rate_tva": False, + "card_equals_total": True, + "has_client_cui": True, # May have client CUI + "has_efactura": False, + "is_non_vat_payer": True, # CRITICAL: Non-VAT payer + "tva_pattern": "none", + } diff --git a/backend/modules/data_entry/services/ocr/profiles/brick.py b/backend/modules/data_entry/services/ocr/profiles/brick.py new file mode 100644 index 0000000..468d76a --- /dev/null +++ b/backend/modules/data_entry/services/ocr/profiles/brick.py @@ -0,0 +1,101 @@ +""" +BRICK (Five-Holding) store profile for OCR extraction. + +Five-Holding S.A. operates BRICK stores with standard receipt format. +""" + +import re +from decimal import Decimal, InvalidOperation +from typing import List, Dict, Any + +from .base import BaseStoreProfile +from . import ProfileRegistry + + +@ProfileRegistry.register +class BrickProfile(BaseStoreProfile): + """ + FIVE-HOLDING S.A. (BRICK) - standard TVA format. + + Key characteristics: + - Standard TVA format + - Single TVA rate typically + - No client CUI on receipts + """ + + CUI_LIST = ["10562600"] + NAME_PATTERNS = ["BRICK", "FIVE-HOLDING", "FIVE HOLDING", "BR1CK"] # OCR variants + STORE_NAME = "FIVE-HOLDING S.A." + + # Standard TVA patterns (flexible - accepts any rate) + TVA_PATTERNS = [ + # "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY" + r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)', + # "A - XX,XX% = YY,YY" + r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)', + # Simple: "TVA XX% YY,YY" + r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)', + ] + + def extract_tva_entries(self, text: str) -> List[dict]: + """ + Extract BRICK-specific TVA entries. + + Args: + text: Raw OCR text from receipt + + Returns: + List of TVA entries with code, percent, and amount + """ + entries = [] + seen = set() + + # Try coded patterns first + for pattern in self.TVA_PATTERNS[:2]: + for match in re.finditer(pattern, text, re.IGNORECASE): + try: + code = match.group(1).upper() + percent = int(match.group(2)) + amount = self._parse_decimal(match.group(3)) + + if amount and amount > 0: + entry_key = (code, percent) + if entry_key not in seen: + entries.append({ + 'code': code, + 'percent': percent, + 'amount': amount + }) + seen.add(entry_key) + except (ValueError, InvalidOperation, IndexError): + continue + + # Fallback to simple format + if not entries: + simple_pattern = self.TVA_PATTERNS[2] + for match in re.finditer(simple_pattern, text, re.IGNORECASE): + try: + percent = int(match.group(1)) + amount = self._parse_decimal(match.group(2)) + + if amount and amount > 0: + entries.append({ + 'code': 'A', + 'percent': percent, + 'amount': amount + }) + break + except (ValueError, InvalidOperation): + continue + + return entries + + def get_validation_hints(self) -> Dict[str, Any]: + """Return BRICK-specific validation hints.""" + return { + "has_multi_rate_tva": False, + "card_equals_total": False, + "has_client_cui": False, + "has_efactura": False, + "is_non_vat_payer": False, + } diff --git a/backend/modules/data_entry/services/ocr/profiles/dedeman.py b/backend/modules/data_entry/services/ocr/profiles/dedeman.py new file mode 100644 index 0000000..9d6c97e --- /dev/null +++ b/backend/modules/data_entry/services/ocr/profiles/dedeman.py @@ -0,0 +1,118 @@ +""" +DEDEMAN store profile for OCR extraction. + +Dedeman receipts may include e-factura information and use standard TVA format. +Large DIY retailer in Romania. +""" + +import re +from decimal import Decimal, InvalidOperation +from typing import List, Dict, Any + +from .base import BaseStoreProfile +from . import ProfileRegistry + + +@ProfileRegistry.register +class DedemanProfile(BaseStoreProfile): + """ + DEDEMAN SRL - standard TVA with e-factura support. + + Key characteristics: + - Standard TVA format + - May include e-factura reference number + - Professional receipts for construction materials + """ + + CUI_LIST = ["2816464"] + NAME_PATTERNS = ["DEDEMAN", "DEDEMAN SRL", "OEDEMAN", "D3DEMAN"] # OCR variants + STORE_NAME = "DEDEMAN SRL" + + # Standard TVA patterns (flexible - accepts any rate) + TVA_PATTERNS = [ + # "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY" + r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)', + # "A - XX,XX% = YY,YY" + r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)', + # "TVA (XX%) YY,YY" + r'TVA\s*\(?\s*(\d{1,2})\s*%\s*\)?\s*:?\s*([\d.,]+)', + ] + + # E-factura pattern for reference extraction + EFACTURA_PATTERN = r'e-?factura\s*:?\s*([A-Z0-9]+)' + + def extract_tva_entries(self, text: str) -> List[dict]: + """ + Extract Dedeman-specific TVA entries. + + Args: + text: Raw OCR text from receipt + + Returns: + List of TVA entries with code, percent, and amount + """ + entries = [] + seen = set() + + # Try coded patterns first + for pattern in self.TVA_PATTERNS[:2]: + for match in re.finditer(pattern, text, re.IGNORECASE): + try: + code = match.group(1).upper() + percent = int(match.group(2)) + amount = self._parse_decimal(match.group(3)) + + if amount and amount > 0: + entry_key = (code, percent) + if entry_key not in seen: + entries.append({ + 'code': code, + 'percent': percent, + 'amount': amount + }) + seen.add(entry_key) + except (ValueError, InvalidOperation, IndexError): + continue + + # Fallback to simple format + if not entries: + simple_pattern = self.TVA_PATTERNS[2] + for match in re.finditer(simple_pattern, text, re.IGNORECASE): + try: + percent = int(match.group(1)) + amount = self._parse_decimal(match.group(2)) + + if amount and amount > 0: + entries.append({ + 'code': 'A', + 'percent': percent, + 'amount': amount + }) + break + except (ValueError, InvalidOperation): + continue + + return entries + + def extract_efactura_reference(self, text: str) -> str | None: + """ + Extract e-factura reference number if present. + + Args: + text: Raw OCR text from receipt + + Returns: + E-factura reference string or None + """ + match = re.search(self.EFACTURA_PATTERN, text, re.IGNORECASE) + return match.group(1) if match else None + + def get_validation_hints(self) -> Dict[str, Any]: + """Return Dedeman-specific validation hints.""" + return { + "has_multi_rate_tva": False, + "card_equals_total": False, + "has_client_cui": False, + "has_efactura": True, + "is_non_vat_payer": False, + } diff --git a/backend/modules/data_entry/services/ocr/profiles/electrobering.py b/backend/modules/data_entry/services/ocr/profiles/electrobering.py new file mode 100644 index 0000000..ec08f59 --- /dev/null +++ b/backend/modules/data_entry/services/ocr/profiles/electrobering.py @@ -0,0 +1,102 @@ +""" +ELECTROBERING S.R.L. store profile for OCR extraction. + +Electronics and home supplies store. +""" + +import re +from decimal import Decimal, InvalidOperation +from typing import List, Dict, Any + +from .base import BaseStoreProfile +from . import ProfileRegistry + + +@ProfileRegistry.register +class ElectroberingProfile(BaseStoreProfile): + """ + ELECTROBERING S.R.L. - standard TVA profile. + + Key characteristics: + - Standard TVA format (single rate, any percentage) + - Electronics and home supplies + - May have client CUI for B2B purchases + - CARD payment typical + """ + + CUI_LIST = ["2744937"] + NAME_PATTERNS = ["ELECTROBERING", "ELECTR0BERING", "ELECTROBERING SRL"] + STORE_NAME = "ELECTROBERING S.R.L." + + # Standard TVA patterns (flexible - accepts any rate) + TVA_PATTERNS = [ + # "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY" + r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)', + # "A - XX,XX% = YY,YY" + r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)', + # "TVA XX% YY,YY" (simple format without code) + r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)', + ] + + def extract_tva_entries(self, text: str) -> List[dict]: + """ + Extract TVA entries from receipt text. + + Args: + text: Raw OCR text from receipt + + Returns: + List of TVA entries with code, percent, and amount + """ + entries = [] + seen = set() + + # Try coded patterns first + for pattern in self.TVA_PATTERNS[:2]: + for match in re.finditer(pattern, text, re.IGNORECASE): + try: + code = match.group(1).upper() + percent = int(match.group(2)) + amount = self._parse_decimal(match.group(3)) + + if amount and amount > 0: + entry_key = (code, percent) + if entry_key not in seen: + entries.append({ + 'code': code, + 'percent': percent, + 'amount': amount + }) + seen.add(entry_key) + except (ValueError, InvalidOperation, IndexError): + continue + + # Fallback to simple format + if not entries: + simple_pattern = self.TVA_PATTERNS[2] + for match in re.finditer(simple_pattern, text, re.IGNORECASE): + try: + percent = int(match.group(1)) + amount = self._parse_decimal(match.group(2)) + + if amount and amount > 0: + entries.append({ + 'code': 'A', + 'percent': percent, + 'amount': amount + }) + break + except (ValueError, InvalidOperation): + continue + + return entries + + def get_validation_hints(self) -> Dict[str, Any]: + """Return ELECTROBERING-specific validation hints.""" + return { + "has_multi_rate_tva": False, + "card_equals_total": True, + "has_client_cui": True, # May have client CUI for B2B + "has_efactura": False, + "is_non_vat_payer": False, + } diff --git a/backend/modules/data_entry/services/ocr/profiles/gama_ink.py b/backend/modules/data_entry/services/ocr/profiles/gama_ink.py new file mode 100644 index 0000000..8fd4bff --- /dev/null +++ b/backend/modules/data_entry/services/ocr/profiles/gama_ink.py @@ -0,0 +1,103 @@ +""" +GAMA INK SERVICE SRL store profile for OCR extraction. + +Toner refill and printer supplies store. +""" + +import re +from decimal import Decimal, InvalidOperation +from typing import List, Dict, Any + +from .base import BaseStoreProfile +from . import ProfileRegistry + + +@ProfileRegistry.register +class GamaInkProfile(BaseStoreProfile): + """ + GAMA INK SERVICE SRL - standard TVA profile. + + Key characteristics: + - Standard TVA format (single rate, any percentage) + - Service-based (toner refill, printer supplies) + - CARD payment typical + """ + + CUI_LIST = ["17741882"] + NAME_PATTERNS = ["GAMA INK", "GAMA", "GAMAINK", "GAMA INK SERVICE"] + STORE_NAME = "GAMA INK SERVICE SRL" + + # Standard TVA patterns (flexible - accepts any rate) + TVA_PATTERNS = [ + # "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY" + r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)', + # "A - XX,XX% = YY,YY" + r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)', + # "TVA XX% YY,YY" (simple format without code) + r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)', + # "TVA: YY,YY" (amount only, percent inferred) + r'TVA\s*:?\s*([\d.,]+)\s*(?:LEI|RON)?', + ] + + def extract_tva_entries(self, text: str) -> List[dict]: + """ + Extract TVA entries from receipt text. + + Args: + text: Raw OCR text from receipt + + Returns: + List of TVA entries with code, percent, and amount + """ + entries = [] + seen = set() + + # Try coded patterns first (have both code and percent) + for pattern in self.TVA_PATTERNS[:2]: + for match in re.finditer(pattern, text, re.IGNORECASE): + try: + code = match.group(1).upper() + percent = int(match.group(2)) + amount = self._parse_decimal(match.group(3)) + + if amount and amount > 0: + entry_key = (code, percent) + if entry_key not in seen: + entries.append({ + 'code': code, + 'percent': percent, + 'amount': amount + }) + seen.add(entry_key) + except (ValueError, InvalidOperation, IndexError): + continue + + # Fallback to simple format (percent + amount without code) + if not entries: + simple_pattern = self.TVA_PATTERNS[2] + for match in re.finditer(simple_pattern, text, re.IGNORECASE): + try: + percent = int(match.group(1)) + amount = self._parse_decimal(match.group(2)) + + if amount and amount > 0: + entries.append({ + 'code': 'A', + 'percent': percent, + 'amount': amount + }) + break + except (ValueError, InvalidOperation): + continue + + return entries + + def get_validation_hints(self) -> Dict[str, Any]: + """Return GAMA INK-specific validation hints.""" + return { + "has_multi_rate_tva": False, + "card_equals_total": True, + "has_client_cui": False, + "has_efactura": False, + "is_non_vat_payer": False, + } diff --git a/backend/modules/data_entry/services/ocr/profiles/kineterra.py b/backend/modules/data_entry/services/ocr/profiles/kineterra.py new file mode 100644 index 0000000..1d787e0 --- /dev/null +++ b/backend/modules/data_entry/services/ocr/profiles/kineterra.py @@ -0,0 +1,53 @@ +""" +KINETERRA store profile for OCR extraction. + +Kineterra is a non-VAT payer (neplătitor de TVA). +Receipts don't include TVA breakdown. +""" + +from typing import List, Dict, Any + +from .base import BaseStoreProfile +from . import ProfileRegistry + + +@ProfileRegistry.register +class KineterraProfile(BaseStoreProfile): + """ + KINETERRA CONCEPT SRL - non-VAT payer profile. + + Key characteristics: + - Non-VAT payer (neplătitor de TVA) + - No TVA breakdown on receipts + - Total amount has no TVA component + """ + + CUI_LIST = ["31180432"] + NAME_PATTERNS = ["KINETERRA", "KINETERRA CONCEPT", "K1NETERRA"] # OCR variants + STORE_NAME = "KINETERRA CONCEPT SRL" + + def extract_tva_entries(self, text: str) -> List[dict]: + """ + Extract TVA entries - returns empty for non-VAT payer. + + Kineterra is a non-VAT payer, so no TVA entries are expected. + + Args: + text: Raw OCR text from receipt (unused) + + Returns: + Empty list (non-VAT payer has no TVA) + """ + # Non-VAT payer - no TVA entries + return [] + + def get_validation_hints(self) -> Dict[str, Any]: + """Return Kineterra-specific validation hints.""" + return { + "has_multi_rate_tva": False, + "card_equals_total": False, + "has_client_cui": False, + "has_efactura": False, + "is_non_vat_payer": True, + "tva_pattern": "none", + } diff --git a/backend/modules/data_entry/services/ocr/profiles/lidl.py b/backend/modules/data_entry/services/ocr/profiles/lidl.py new file mode 100644 index 0000000..bcb180b --- /dev/null +++ b/backend/modules/data_entry/services/ocr/profiles/lidl.py @@ -0,0 +1,93 @@ +""" +LIDL store profile for OCR extraction. + +Lidl receipts have a specific TVA format without hyphen/colon separators: + TOTAL TVA 9,84 + TVA A 21,00% 7,71 + TVA B 11,00% 2,13 + +This profile handles multi-rate TVA extraction for Lidl receipts. +""" + +import re +from decimal import Decimal, InvalidOperation +from typing import List, Dict, Any + +from .base import BaseStoreProfile +from . import ProfileRegistry + + +@ProfileRegistry.register +class LidlProfile(BaseStoreProfile): + """ + LIDL DISCOUNT S.R.L. - multi-rate TVA profile. + + Key characteristics: + - Multi-rate TVA (codes A, B, C, D with any percentage - patterns are flexible) + - TVA format: "TVA A XX,XX% YY,YY" (code + percent + amount on same line) + - Supports historical rates (19%, 9%, 5%) and current rates (21%, 11%) + - CARD payment usually equals total + - No client CUI on receipts + """ + + CUI_LIST = ["22891860"] + NAME_PATTERNS = ["LIDL", "LDL", "L1DL", "LIDL DISCOUNT"] # OCR variants + STORE_NAME = "LIDL DISCOUNT S.R.L." + + # Lidl-specific TVA patterns + # Format: "TVA A 21,00% 7,71" (code + percent + amount on same line) + TVA_PATTERNS = [ + # Primary: "TVA A 21,00% 7.71" with various spacing + r'T[VU][AR]\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)', + # With backslash OCR artifact: "TVA A \21,00% 7.71" + r'T[VU][AR]\s+([A-D])\s+\\?(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)', + # IVA variant (rare OCR misread) + r'IVA\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)', + ] + + def extract_tva_entries(self, text: str) -> List[dict]: + """ + Extract Lidl-specific TVA entries. + + Handles multiple TVA rates (A, B, C, D) commonly found on Lidl receipts. + Uses deduplication to avoid counting the same entry twice from different patterns. + + Args: + text: Raw OCR text from receipt + + Returns: + List of TVA entries with code, percent, and amount + """ + entries = [] + seen = set() # Deduplication key: (code, percent) + + for pattern in self.TVA_PATTERNS: + for match in re.finditer(pattern, text, re.IGNORECASE): + try: + code = match.group(1).upper() + percent = int(match.group(2)) + amount = self._parse_decimal(match.group(3)) + + if amount and amount > 0: + entry_key = (code, percent) + if entry_key not in seen: + entries.append({ + 'code': code, + 'percent': percent, + 'amount': amount + }) + seen.add(entry_key) + except (ValueError, InvalidOperation): + continue + + return entries + + def get_validation_hints(self) -> Dict[str, Any]: + """Return Lidl-specific validation hints.""" + return { + "has_multi_rate_tva": True, + "card_equals_total": True, + "has_client_cui": False, + "has_efactura": False, + "is_non_vat_payer": False, + } diff --git a/backend/modules/data_entry/services/ocr/profiles/omv.py b/backend/modules/data_entry/services/ocr/profiles/omv.py new file mode 100644 index 0000000..9cbff98 --- /dev/null +++ b/backend/modules/data_entry/services/ocr/profiles/omv.py @@ -0,0 +1,99 @@ +""" +OMV Petrom store profile for OCR extraction. + +OMV receipts typically include client CUI and use standard TVA format. +Common at gas stations with fuel purchases. + +Date format: YYYY. MM. DD with spaces (e.g., "2025. 08. 14") +""" + +import re +from datetime import date +from decimal import Decimal, InvalidOperation +from typing import List, Dict, Any, Tuple, Optional + +from .base import BaseStoreProfile +from . import ProfileRegistry + + +@ProfileRegistry.register +class OMVProfile(BaseStoreProfile): + """ + OMV PETROM MARKETING S.R.L. - standard TVA with client CUI. + + Key characteristics: + - Standard TVA format (usually single rate, any percentage) + - Includes client CUI on receipt (for business purchases) + - TVA table format: "A-XX,XX% base_amount tva_amount" + - Supports historical rates (19%) and current rates (21%) + - Date format: YYYY. MM. DD (with spaces) + """ + + CUI_LIST = ["11201891"] + NAME_PATTERNS = ["OMV", "PETROM", "OMV PETROM", "0MV"] # OCR variants + STORE_NAME = "OMV PETROM MARKETING S.R.L." + + # OMV TVA table pattern: "A-19,00% 285,66 49,58" (code-percent base tva) + TVA_TABLE_PATTERN = r'([A-D])\s*[-:]\s*(\d{1,2})[.,]\d{2}\s*%\s+([\d.,]+)\s+([\d.,]+)' + + # Standard TVA pattern fallback + TVA_STANDARD_PATTERN = r'TVA\s*:?\s*([\d.,]+)' + + # OMV specific: prioritize YYYY. MM. DD format with spaces + DATE_PATTERNS_OCR_SPACES = [ + # YYYY. MM. DD with time (OMV format) + (r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})\s+\d{2}:\d{2}', 0.98, 'ymd'), + (r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})', 0.95, 'ymd'), + # Fallback to DD. MM. YYYY + (r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})\s+\d{2}:\d{2}', 0.92, 'dmy'), + (r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})', 0.85, 'dmy'), + ] + + def extract_tva_entries(self, text: str) -> List[dict]: + """ + Extract OMV-specific TVA entries. + + OMV receipts often show TVA in table format with base and TVA amounts. + Falls back to standard extraction if table format not found. + + Args: + text: Raw OCR text from receipt + + Returns: + List of TVA entries with code, percent, and amount + """ + entries = [] + seen = set() + + # Try table format first (more accurate) + for match in re.finditer(self.TVA_TABLE_PATTERN, text, re.IGNORECASE): + try: + code = match.group(1).upper() + percent = int(match.group(2)) + # TVA amount is the second number (smaller one) + tva_amount = self._parse_decimal(match.group(4)) + + if tva_amount and tva_amount > 0: + entry_key = (code, percent) + if entry_key not in seen: + entries.append({ + 'code': code, + 'percent': percent, + 'amount': tva_amount + }) + seen.add(entry_key) + except (ValueError, InvalidOperation): + continue + + return entries + + def get_validation_hints(self) -> Dict[str, Any]: + """Return OMV-specific validation hints.""" + return { + "has_multi_rate_tva": False, + "card_equals_total": False, + "has_client_cui": True, + "has_efactura": False, + "is_non_vat_payer": False, + "tva_table_format": True, + } diff --git a/backend/modules/data_entry/services/ocr/profiles/pictus_velum.py b/backend/modules/data_entry/services/ocr/profiles/pictus_velum.py new file mode 100644 index 0000000..cf985f7 --- /dev/null +++ b/backend/modules/data_entry/services/ocr/profiles/pictus_velum.py @@ -0,0 +1,101 @@ +""" +PICTUS VELUM SRL store profile for OCR extraction. + +Office supplies and stationery store. +""" + +import re +from decimal import Decimal, InvalidOperation +from typing import List, Dict, Any + +from .base import BaseStoreProfile +from . import ProfileRegistry + + +@ProfileRegistry.register +class PictusVelumProfile(BaseStoreProfile): + """ + PICTUS VELUM SRL - standard TVA profile. + + Key characteristics: + - Standard TVA format (single rate, any percentage) + - Office supplies and stationery (rechizite) + - CARD payment typical + """ + + CUI_LIST = ["39634534"] + NAME_PATTERNS = ["PICTUS", "PICTUS VELUM", "P1CTUS", "PICTUS VELUM SRL"] + STORE_NAME = "PICTUS VELUM SRL" + + # Standard TVA patterns (flexible - accepts any rate) + TVA_PATTERNS = [ + # "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY" + r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)', + # "A - XX,XX% = YY,YY" + r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)', + # "TVA XX% YY,YY" (simple format without code) + r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)', + ] + + def extract_tva_entries(self, text: str) -> List[dict]: + """ + Extract TVA entries from receipt text. + + Args: + text: Raw OCR text from receipt + + Returns: + List of TVA entries with code, percent, and amount + """ + entries = [] + seen = set() + + # Try coded patterns first + for pattern in self.TVA_PATTERNS[:2]: + for match in re.finditer(pattern, text, re.IGNORECASE): + try: + code = match.group(1).upper() + percent = int(match.group(2)) + amount = self._parse_decimal(match.group(3)) + + if amount and amount > 0: + entry_key = (code, percent) + if entry_key not in seen: + entries.append({ + 'code': code, + 'percent': percent, + 'amount': amount + }) + seen.add(entry_key) + except (ValueError, InvalidOperation, IndexError): + continue + + # Fallback to simple format + if not entries: + simple_pattern = self.TVA_PATTERNS[2] + for match in re.finditer(simple_pattern, text, re.IGNORECASE): + try: + percent = int(match.group(1)) + amount = self._parse_decimal(match.group(2)) + + if amount and amount > 0: + entries.append({ + 'code': 'A', + 'percent': percent, + 'amount': amount + }) + break + except (ValueError, InvalidOperation): + continue + + return entries + + def get_validation_hints(self) -> Dict[str, Any]: + """Return PICTUS VELUM-specific validation hints.""" + return { + "has_multi_rate_tva": False, + "card_equals_total": True, + "has_client_cui": False, + "has_efactura": False, + "is_non_vat_payer": False, + } diff --git a/backend/modules/data_entry/services/ocr/profiles/socar.py b/backend/modules/data_entry/services/ocr/profiles/socar.py new file mode 100644 index 0000000..541c9a4 --- /dev/null +++ b/backend/modules/data_entry/services/ocr/profiles/socar.py @@ -0,0 +1,111 @@ +""" +SOCAR Petroleum store profile for OCR extraction. + +SOCAR receipts are similar to OMV - gas station with client CUI support. +Date format may use YYYY. MM. DD with spaces. +""" + +import re +from datetime import date +from decimal import Decimal, InvalidOperation +from typing import List, Dict, Any, Tuple, Optional + +from .base import BaseStoreProfile +from . import ProfileRegistry + + +@ProfileRegistry.register +class SocarProfile(BaseStoreProfile): + """ + SOCAR PETROLEUM S.A. - standard TVA with client CUI. + + Key characteristics: + - Standard TVA format (usually single rate) + - Includes client CUI on receipt (for business purchases) + - Similar format to OMV/Petrom + - Date format may use YYYY. MM. DD (with spaces) + """ + + CUI_LIST = ["12546600"] + NAME_PATTERNS = ["SOCAR", "S0CAR", "SOCAR PETROLEUM"] # OCR variants + STORE_NAME = "SOCAR PETROLEUM S.A." + + # Standard TVA patterns for gas stations + TVA_PATTERNS = [ + # Table format: "A-19,00% 285,66 49,58" + r'([A-D])\s*[-:]\s*(\d{1,2})[.,]\d{2}\s*%\s+([\d.,]+)\s+([\d.,]+)', + # Simple format: "TVA 19% 49,58" + r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)', + ] + + # Gas stations may use YYYY. MM. DD format + DATE_PATTERNS_OCR_SPACES = [ + (r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})\s+\d{2}:\d{2}', 0.98, 'ymd'), + (r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})', 0.95, 'ymd'), + (r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})\s+\d{2}:\d{2}', 0.92, 'dmy'), + (r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})', 0.85, 'dmy'), + ] + + def extract_tva_entries(self, text: str) -> List[dict]: + """ + Extract SOCAR-specific TVA entries. + + Args: + text: Raw OCR text from receipt + + Returns: + List of TVA entries with code, percent, and amount + """ + entries = [] + seen = set() + + # Try table format first + table_pattern = self.TVA_PATTERNS[0] + for match in re.finditer(table_pattern, text, re.IGNORECASE): + try: + code = match.group(1).upper() + percent = int(match.group(2)) + tva_amount = self._parse_decimal(match.group(4)) + + if tva_amount and tva_amount > 0: + entry_key = (code, percent) + if entry_key not in seen: + entries.append({ + 'code': code, + 'percent': percent, + 'amount': tva_amount + }) + seen.add(entry_key) + except (ValueError, InvalidOperation): + continue + + # Fallback to simple format if no table entries found + if not entries: + simple_pattern = self.TVA_PATTERNS[1] + for match in re.finditer(simple_pattern, text, re.IGNORECASE): + try: + percent = int(match.group(1)) + amount = self._parse_decimal(match.group(2)) + + if amount and amount > 0: + # Default to code 'A' for simple format + entries.append({ + 'code': 'A', + 'percent': percent, + 'amount': amount + }) + break # Only take first match for simple format + except (ValueError, InvalidOperation): + continue + + return entries + + def get_validation_hints(self) -> Dict[str, Any]: + """Return SOCAR-specific validation hints.""" + return { + "has_multi_rate_tva": False, + "card_equals_total": False, + "has_client_cui": True, + "has_efactura": False, + "is_non_vat_payer": False, + } diff --git a/backend/modules/data_entry/services/ocr/profiles/stepout_market.py b/backend/modules/data_entry/services/ocr/profiles/stepout_market.py new file mode 100644 index 0000000..cda1b52 --- /dev/null +++ b/backend/modules/data_entry/services/ocr/profiles/stepout_market.py @@ -0,0 +1,112 @@ +""" +STEPOUT MARKET SRL store profile for OCR extraction. + +Bookstore with reduced TVA rate (5% for books in Romania). +""" + +import re +from decimal import Decimal, InvalidOperation +from typing import List, Dict, Any + +from .base import BaseStoreProfile +from . import ProfileRegistry + + +@ProfileRegistry.register +class StepoutMarketProfile(BaseStoreProfile): + """ + STEPOUT MARKET SRL - reduced TVA rate profile (books). + + Key characteristics: + - Reduced TVA rate: 5% for books (cărți qualification in Romania) + - May also have standard rates for non-book items + - Patterns are flexible to accept ANY TVA rate + - CARD payment typical + """ + + CUI_LIST = ["35532655"] + NAME_PATTERNS = ["STEPOUT", "STEPOUT MARKET", "STEP0UT", "STEPOUT MARKET SRL"] + STORE_NAME = "STEPOUT MARKET SRL" + + # TVA patterns (flexible - accepts any rate including 5%) + TVA_PATTERNS = [ + # "TVA A: 5% = YY,YY" or "TVA-A 5% YY,YY" (coded format) + r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)', + # "A - 5,00% = YY,YY" (table format) + r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)', + # "TVA 5% YY,YY" (simple format - common for single rate) + r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)', + # "TVA 5,00%: YY,YY" (percent with colon) + r'TVA\s+(\d{1,2})[.,]\d{2}\s*%\s*:?\s*([\d.,]+)', + ] + + def extract_tva_entries(self, text: str) -> List[dict]: + """ + Extract TVA entries from receipt text. + + Stepout Market primarily sells books which have 5% TVA in Romania. + The patterns are generic and will extract whatever rate is on the receipt. + + Args: + text: Raw OCR text from receipt + + Returns: + List of TVA entries with code, percent, and amount + """ + entries = [] + seen = set() + + # Try coded patterns first (have code letter) + for pattern in self.TVA_PATTERNS[:2]: + for match in re.finditer(pattern, text, re.IGNORECASE): + try: + code = match.group(1).upper() + percent = int(match.group(2)) + amount = self._parse_decimal(match.group(3)) + + if amount and amount > 0: + entry_key = (code, percent) + if entry_key not in seen: + entries.append({ + 'code': code, + 'percent': percent, + 'amount': amount + }) + seen.add(entry_key) + except (ValueError, InvalidOperation, IndexError): + continue + + # Fallback to simple format (no code letter, just percent + amount) + if not entries: + for pattern in self.TVA_PATTERNS[2:]: + for match in re.finditer(pattern, text, re.IGNORECASE): + try: + percent = int(match.group(1)) + amount = self._parse_decimal(match.group(2)) + + if amount and amount > 0: + # Default to code 'A' for simple format + entries.append({ + 'code': 'A', + 'percent': percent, + 'amount': amount + }) + break # Only take first match for simple format + except (ValueError, InvalidOperation): + continue + if entries: + break + + return entries + + def get_validation_hints(self) -> Dict[str, Any]: + """Return STEPOUT MARKET-specific validation hints.""" + return { + "has_multi_rate_tva": False, + "card_equals_total": True, + "has_client_cui": True, # May have client CUI + "has_efactura": False, + "is_non_vat_payer": False, + "typical_tva_rate": 5, # Books have 5% TVA in Romania + "product_category": "books", + } diff --git a/backend/modules/data_entry/services/ocr/profiles/unlimited_keys.py b/backend/modules/data_entry/services/ocr/profiles/unlimited_keys.py new file mode 100644 index 0000000..3c629ca --- /dev/null +++ b/backend/modules/data_entry/services/ocr/profiles/unlimited_keys.py @@ -0,0 +1,103 @@ +""" +UNLIMITED KEYS S.R.L. store profile for OCR extraction. + +Key duplication service. Notable for CASH (NUMERAR) payments. +""" + +import re +from decimal import Decimal, InvalidOperation +from typing import List, Dict, Any + +from .base import BaseStoreProfile +from . import ProfileRegistry + + +@ProfileRegistry.register +class UnlimitedKeysProfile(BaseStoreProfile): + """ + UNLIMITED KEYS S.R.L. - standard TVA profile with NUMERAR payment. + + Key characteristics: + - Standard TVA format (single rate, any percentage) + - Key duplication service + - NUMERAR (cash) payment common - different from most stores! + - May also accept CARD + """ + + CUI_LIST = ["18993187"] + NAME_PATTERNS = ["UNLIMITED KEYS", "UNLIMITED", "UNL1MITED", "UNLIMITED KEYS SRL"] + STORE_NAME = "UNLIMITED KEYS S.R.L." + + # Standard TVA patterns (flexible - accepts any rate) + TVA_PATTERNS = [ + # "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY" + r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)', + # "A - XX,XX% = YY,YY" + r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)', + # "TVA XX% YY,YY" (simple format without code) + r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)', + ] + + def extract_tva_entries(self, text: str) -> List[dict]: + """ + Extract TVA entries from receipt text. + + Args: + text: Raw OCR text from receipt + + Returns: + List of TVA entries with code, percent, and amount + """ + entries = [] + seen = set() + + # Try coded patterns first + for pattern in self.TVA_PATTERNS[:2]: + for match in re.finditer(pattern, text, re.IGNORECASE): + try: + code = match.group(1).upper() + percent = int(match.group(2)) + amount = self._parse_decimal(match.group(3)) + + if amount and amount > 0: + entry_key = (code, percent) + if entry_key not in seen: + entries.append({ + 'code': code, + 'percent': percent, + 'amount': amount + }) + seen.add(entry_key) + except (ValueError, InvalidOperation, IndexError): + continue + + # Fallback to simple format + if not entries: + simple_pattern = self.TVA_PATTERNS[2] + for match in re.finditer(simple_pattern, text, re.IGNORECASE): + try: + percent = int(match.group(1)) + amount = self._parse_decimal(match.group(2)) + + if amount and amount > 0: + entries.append({ + 'code': 'A', + 'percent': percent, + 'amount': amount + }) + break + except (ValueError, InvalidOperation): + continue + + return entries + + def get_validation_hints(self) -> Dict[str, Any]: + """Return UNLIMITED KEYS-specific validation hints.""" + return { + "has_multi_rate_tva": False, + "card_equals_total": False, # May be NUMERAR (cash) + "has_client_cui": True, # May have client CUI + "has_efactura": False, + "is_non_vat_payer": False, + "common_payment": "NUMERAR", # Cash payments common + } diff --git a/backend/modules/data_entry/services/ocr_extractor.py b/backend/modules/data_entry/services/ocr_extractor.py index 5a87190..2667c7a 100644 --- a/backend/modules/data_entry/services/ocr_extractor.py +++ b/backend/modules/data_entry/services/ocr_extractor.py @@ -7,6 +7,7 @@ from typing import Optional, Tuple, List from dataclasses import dataclass, field from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine +from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry @dataclass @@ -63,6 +64,57 @@ class ExtractionResult: class ReceiptExtractor: """Extract receipt fields using pattern matching for Romanian receipts.""" + # ========================================================================= + # DEPRECATED: STORE_PROFILES dict - USE ProfileRegistry INSTEAD + # ========================================================================= + # Store profiles are now managed by ProfileRegistry in: + # backend/modules/data_entry/services/ocr/profiles/ + # + # This dict is kept for reference only. All extraction logic now uses: + # ProfileRegistry.get_profile(cui) + # + # See: backend/modules/data_entry/services/ocr/profiles/README.md + # ========================================================================= + STORE_PROFILES = { + # Lidl - multi-rate TVA (A+B), specific format without hyphen/colon + "22891860": { + "name": "LIDL DISCOUNT S.R.L.", + "tva_pattern": "lidl", + "tva_format": "TVA {code} {percent}% {amount}", + "has_multi_rate_tva": True, + "card_equals_total": True, + }, + # OMV Petrom - single TVA rate, client CUI included + "11201891": { + "name": "OMV PETROM MARKETING S.R.L.", + "tva_pattern": "standard", + "has_client_cui": True, + }, + # FIVE-HOLDING (BRICK) - standard format + "10562600": { + "name": "FIVE-HOLDING S.A.", + "tva_pattern": "standard", + }, + # Dedeman - e-factura format + "2816464": { + "name": "DEDEMAN SRL", + "tva_pattern": "standard", + "has_efactura": True, + }, + # SOCAR Petroleum + "12546600": { + "name": "SOCAR PETROLEUM S.A.", + "tva_pattern": "standard", + "has_client_cui": True, + }, + # Kineterra - non-VAT payer + "31180432": { + "name": "KINETERRA CONCEPT SRL", + "tva_pattern": "none", + "is_non_vat_payer": True, + }, + } + # Total amount patterns (most specific first) # Romanian receipts use various formats: TOTAL LEI, TOTAL:, TOTAL RON, etc. # OCR often produces errors, so patterns must be tolerant @@ -394,48 +446,101 @@ class ReceiptExtractor: result.raw_text = text text_upper = text.upper() - # Extract core fields - result.amount, result.confidence_amount = self._extract_amount(text_upper) - result.receipt_date, result.confidence_date = self._extract_date(text_upper) - result.receipt_number, _ = self._extract_number(text_upper) - result.receipt_series, _ = self._extract_series(text_upper) + # ========================================================================= + # STEP 1: Extract vendor info FIRST to find store profile + # ========================================================================= result.partner_name, result.confidence_vendor = self._extract_vendor(text) result.cui, _ = self._extract_cui(text_upper, text) - # Normalize CUI: fix R0 → RO OCR error and validate format result.cui = OCRValidationEngine.normalize_cui(result.cui) - # Extract additional fields - Multiple TVA entries - result.tva_entries, result.tva_total = self._extract_tva_entries(text_upper) + # Lookup store-specific profile for enhanced extraction accuracy + store_profile = ProfileRegistry.get_profile(result.cui) if result.cui else None + if store_profile: + print(f"[Profile] Using {store_profile.__class__.__name__} for CUI {result.cui}", flush=True) + + # ========================================================================= + # STEP 2: Extract ALL fields using profile (if available) or generic + # ========================================================================= + if store_profile: + # Profile-specific extraction (higher accuracy for known stores) + result.amount, result.confidence_amount = store_profile.extract_total(text_upper) + result.receipt_date, result.confidence_date = store_profile.extract_date(text_upper) + result.receipt_number, _ = store_profile.extract_receipt_number(text_upper) + result.tva_entries = store_profile.extract_tva_entries(text_upper) + result.tva_total = sum(e['amount'] for e in result.tva_entries) if result.tva_entries else None + result.payment_methods = store_profile.extract_payment_methods(text_upper) + + # Client data extraction via profile (CUI + name) + profile_client_cui, cui_confidence = store_profile.extract_client_cui(text_upper) + profile_client_name, name_confidence = store_profile.extract_client_name(text) + + if profile_client_cui or profile_client_name: + # Use profile extraction results + result.client_cui = OCRValidationEngine.normalize_cui(profile_client_cui) if profile_client_cui else None + result.client_name = profile_client_name + result.confidence_client = max(cui_confidence, name_confidence) + # Address still via generic (no profile method) + _, _, client_address, _ = self._extract_client_data(text_upper, text) + result.client_address = client_address + else: + # Fallback to generic client extraction + client_name, client_cui, client_address, confidence = self._extract_client_data(text_upper, text) + result.client_name = client_name + result.client_cui = OCRValidationEngine.normalize_cui(client_cui) + result.client_address = client_address + result.confidence_client = confidence + + print(f"[Profile] Extracted: total={result.amount}, date={result.receipt_date}, " + f"TVA entries={len(result.tva_entries)}, payments={len(result.payment_methods)}", flush=True) + else: + # Generic extraction for unknown stores + result.amount, result.confidence_amount = self._extract_amount(text_upper) + result.receipt_date, result.confidence_date = self._extract_date(text_upper) + result.receipt_number, _ = self._extract_number(text_upper) + result.tva_entries, result.tva_total = self._extract_tva_entries(text_upper) + result.payment_methods = self._extract_payment_methods(text_upper) + + # Generic client extraction + client_name, client_cui, client_address, confidence = self._extract_client_data(text_upper, text) + result.client_name = client_name + result.client_cui = OCRValidationEngine.normalize_cui(client_cui) + result.client_address = client_address + result.confidence_client = confidence + + # Series extraction (no profile method, always generic) + result.receipt_series, _ = self._extract_series(text_upper) + + # ========================================================================= + # STEP 3: Debug logging and validation + # ========================================================================= if not result.tva_entries: print(f"[TVA Debug] No TVA found. Checking patterns...", flush=True) - # Debug: show what patterns see normalized = re.sub(r'(\d+)[.,]\s+(\d{2})', r'\1.\2', text_upper) taxe_match = re.search(r'T?OTAL\s+TAXE', normalized, re.IGNORECASE) rev_match = re.search(r'([\d.,]+)\s*T?OTAL\s+TAXE', normalized, re.IGNORECASE) print(f"[TVA Debug] 'OTAL TAXE' found: {bool(taxe_match)}, reversed: {rev_match.group(1) if rev_match else None}", flush=True) - # Log TVA vs TOTAL for debugging (validation happens in ocr_service._final_validation) - # NOTE: We NO LONGER clear TVA here - the service will recalculate TOTAL from TVA if needed + # Log TVA vs TOTAL for debugging if result.tva_total and result.amount: if result.tva_total > result.amount: print(f"[TVA Extraction] TVA ({result.tva_total}) > TOTAL ({result.amount}) - will be corrected in final validation", flush=True) elif result.tva_total > result.amount * Decimal('0.5'): print(f"[TVA Extraction] Warning: TVA ({result.tva_total}) is > 50% of TOTAL ({result.amount}) - suspicious", flush=True) + # Additional generic extractions result.items_count = self._extract_items_count(text_upper) result.address = self._extract_address(text_upper) - result.payment_methods = self._extract_payment_methods(text_upper) - # Validate payment methods against extracted amount - # If payment sum >> amount, clear invalid payments (likely OCR error) + # ========================================================================= + # STEP 4: Validate and post-process + # ========================================================================= # Save original payment methods before validation (for payment mode detection) original_payment_methods = result.payment_methods.copy() if result.payment_methods else [] + # Validate payment methods against extracted amount result.payment_methods = self._validate_payment_methods(result.payment_methods, result.amount) # Auto-suggest payment_mode based on detected payment methods - # Use ORIGINAL payment_methods to detect CARD even if validation cleared them - # (e.g., CARD 318.16 is valid even if total validation failed) payment_methods_for_mode = result.payment_methods if result.payment_methods else original_payment_methods if payment_methods_for_mode: card_amount = sum( @@ -447,17 +552,9 @@ class ReceiptExtractor: result.suggested_payment_mode = 'banca' print(f"[Payment Mode] CARD detected ({card_amount}), suggesting 'banca'", flush=True) else: - # Only cash payments detected result.suggested_payment_mode = 'numerar' print(f"[Payment Mode] Cash only detected, suggesting 'numerar'", flush=True) - # Extract client data (B2B receipts) - client_name, client_cui, client_address, confidence_client = self._extract_client_data(text_upper, text) - result.client_name = client_name - result.client_cui = OCRValidationEngine.normalize_cui(client_cui) # Fix R0 → RO OCR error - result.client_address = client_address - result.confidence_client = confidence_client - # Detect receipt type result.receipt_type = self._detect_receipt_type(text_upper) @@ -620,6 +717,40 @@ class ReceiptExtractor: return num_str + def _calculate_multi_rate_tva_total(self, tva_entries: List[dict]) -> Optional[Decimal]: + """ + Calculate implied total from ALL TVA entries (multi-rate support). + + Formula for each entry: total_for_entry = tva * (100 + rate) / rate + Final total = sum of all entry totals + + Example for Lidl (TVA A 21% = 7.71, TVA B 11% = 2.13): + Entry A: 7.71 * 121 / 21 = 44.45 + Entry B: 2.13 * 111 / 11 = 21.49 + Total: 44.45 + 21.49 = 65.94 ≈ 65.86 (within tolerance) + + Returns: + Implied total Decimal, or None if calculation not possible + """ + if not tva_entries: + return None + + total = Decimal('0') + for entry in tva_entries: + rate = entry.get('percent', 0) + tva_amount = entry.get('amount') + if tva_amount and rate > 0: + try: + tva_dec = Decimal(str(tva_amount)) + # Formula: total_for_entry = tva * (100 + rate) / rate + entry_total = tva_dec * Decimal(100 + rate) / Decimal(rate) + total += entry_total + print(f"[Multi-rate TVA] Entry {entry.get('code', '?')}: tva={tva_amount}, rate={rate}% -> implied={entry_total:.2f}", flush=True) + except (InvalidOperation, ValueError, TypeError): + continue + + return total.quantize(Decimal('0.01')) if total > 0 else None + def _cross_validate_and_calculate_amount( self, amount: Optional[Decimal], @@ -634,12 +765,11 @@ class ReceiptExtractor: Returns: (amount, confidence, source_description) Logic: - 1. If amount is valid (>0) with high confidence (>=0.8), use it directly - 2. Calculate payment_sum = CARD + NUMERAR + other methods - 3. Calculate tva_implied_total = tva_total * (100 + rate) / rate - 4. Cross-validate: if payment_sum matches extracted amount, boost confidence - 5. If amount is 0/None, use payment_sum as total - 6. If payment_sum is 0, try to calculate from TVA + 1. Collect all available sources: extracted amount, payment sum, TVA-implied total + 2. Find consensus: 2+ sources within 3% tolerance + 3. If consensus found, use the higher-confidence source value + 4. If extracted differs >10% from all others, it's an outlier - correct it + 5. If no consensus possible, fallback to individual validations """ # Calculate payment methods sum payment_sum = Decimal('0') @@ -652,43 +782,73 @@ class ReceiptExtractor: except (InvalidOperation, ValueError, TypeError): continue - # Calculate TVA-implied total: total = tva * (100 + rate) / rate - tva_implied_total = None - if tva_entries: - # Use the main TVA entry (typically the largest or first one) - main_entry = tva_entries[0] - rate = main_entry.get('percent', 19) - tva_amount = main_entry.get('amount') - if tva_amount and rate > 0: - try: - tva_dec = Decimal(str(tva_amount)) - # total = tva * (100 + rate) / rate - tva_implied_total = (tva_dec * Decimal(100 + rate) / Decimal(rate)).quantize(Decimal('0.01')) - except (InvalidOperation, ValueError, TypeError): - pass + # Calculate TVA-implied total using ALL entries (multi-rate fix) + tva_implied_total = self._calculate_multi_rate_tva_total(tva_entries) - # Case 1: Amount is valid with high confidence - validate against TVA and payments + # Multi-source consensus approach (3% tolerance for multi-rate TVA rounding) + CONSENSUS_TOLERANCE = 3.0 # 3% tolerance + + # Collect all available sources with their confidences + sources = [] + if amount and amount > 0: + sources.append(('extracted', float(amount), confidence_amount)) + if payment_sum > 0: + sources.append(('payment', float(payment_sum), 0.92)) # Payment is very reliable + if tva_implied_total and tva_implied_total > 0: + sources.append(('tva_calc', float(tva_implied_total), 0.88)) # TVA calc is reliable + + print(f"[Cross-Validation] Sources: {[(s[0], f'{s[1]:.2f}', f'{s[2]:.2f}') for s in sources]}", flush=True) + + # Find consensus: 2+ sources within tolerance + if len(sources) >= 2: + for i, (name1, val1, conf1) in enumerate(sources): + for name2, val2, conf2 in sources[i+1:]: + if val1 <= 0 or val2 <= 0: + continue + diff_pct = abs(val1 - val2) / max(val1, val2) * 100 + if diff_pct <= CONSENSUS_TOLERANCE: + # Consensus found! Use value from higher-confidence source + if conf1 >= conf2: + consensus_val, consensus_conf = val1, conf1 + else: + consensus_val, consensus_conf = val2, conf2 + # Boost confidence for consensus + consensus_conf = min(0.98, consensus_conf + 0.05) + print(f"[Cross-Validation] Consensus: {name1}={val1:.2f} ≈ {name2}={val2:.2f} (diff={diff_pct:.1f}%)", flush=True) + return Decimal(str(round(consensus_val, 2))), consensus_conf, f"consensus ({name1}+{name2})" + + # No consensus - check if extracted is an outlier (differs >10% from all others) + if amount and amount > 0 and len(sources) >= 2: + other_sources = [s for s in sources if s[0] != 'extracted'] + if other_sources: + extracted_val = float(amount) + all_differ = all( + abs(extracted_val - s[1]) / max(extracted_val, s[1]) * 100 > 10 + for s in other_sources if s[1] > 0 + ) + if all_differ: + # Extracted differs significantly from all others - use the best other source + best_other = max(other_sources, key=lambda s: s[2]) + print(f"[Cross-Validation] Extracted outlier: {extracted_val:.2f} differs >10% from all others, using {best_other[0]}={best_other[1]:.2f}", flush=True) + return Decimal(str(round(best_other[1], 2))), best_other[2], f"corrected (extracted outlier, using {best_other[0]})" + + # Fallback: Case 1 - Amount valid with high confidence if amount and amount > 0 and confidence_amount >= 0.8: - # First check TVA-implied total (most reliable when TVA is extracted correctly) + # Check TVA-implied total if tva_implied_total and tva_implied_total > 0: tva_diff_percent = abs(float(amount) - float(tva_implied_total)) / float(tva_implied_total) * 100 - if tva_diff_percent <= 1: - # Near-perfect TVA match - highest confidence + if tva_diff_percent <= 3: return amount, min(0.98, confidence_amount + 0.05), "extracted (validated by TVA)" elif tva_diff_percent > 10: - # Significant mismatch - TVA-implied total is more reliable - # This catches cases where wrong TOTAL line was extracted (e.g., REST, SUBTOTAL) print(f"[Cross-Validation] Amount mismatch with TVA: extracted={amount}, tva_implied={tva_implied_total} (diff={tva_diff_percent:.1f}%)", flush=True) return tva_implied_total, 0.90, "calculated from TVA (extracted amount mismatch)" # Cross-validate with payment methods if payment_sum > 0 and abs(amount - payment_sum) <= Decimal('0.02'): - # Perfect match - boost confidence return amount, min(0.98, confidence_amount + 0.05), "extracted (validated by payment methods)" elif payment_sum > 0: payment_diff_percent = abs(float(amount) - float(payment_sum)) / float(payment_sum) * 100 if payment_diff_percent > 10: - # Significant mismatch - payment sum is more reliable print(f"[Cross-Validation] Amount mismatch with payments: extracted={amount}, payments={payment_sum} (diff={payment_diff_percent:.1f}%)", flush=True) return payment_sum, 0.88, "calculated from payment methods (extracted amount mismatch)" @@ -696,29 +856,22 @@ class ReceiptExtractor: # Case 2: Amount exists but low confidence - try to validate/correct if amount and amount > 0: - # First check TVA-implied total (most reliable) if tva_implied_total and tva_implied_total > 0: tva_diff_percent = abs(float(amount) - float(tva_implied_total)) / float(tva_implied_total) * 100 - if tva_diff_percent <= 2: - # Close match - boost confidence + if tva_diff_percent <= 3: return amount, 0.88, "extracted (validated by TVA)" elif tva_diff_percent > 10: - # Significant mismatch - use TVA-implied total print(f"[Cross-Validation] Amount mismatch with TVA: extracted={amount}, tva_implied={tva_implied_total} (diff={tva_diff_percent:.1f}%)", flush=True) return tva_implied_total, 0.85, "calculated from TVA" - # Check if payment methods sum matches if payment_sum > 0: payment_diff_percent = abs(float(amount) - float(payment_sum)) / float(payment_sum) * 100 - if payment_diff_percent <= 0.5: - # Close match - boost confidence + if payment_diff_percent <= 1: return amount, 0.90, "extracted (validated by payment methods)" elif payment_diff_percent > 10: - # Mismatch - prefer payment_sum as it's more reliable print(f"[Cross-Validation] Amount mismatch: extracted={amount}, payments={payment_sum}", flush=True) return payment_sum, 0.85, "calculated from payment methods" - # No validation possible - return as-is return amount, confidence_amount, "extracted (unvalidated)" # Case 3: Amount is 0 or None - calculate from payment methods @@ -946,6 +1099,28 @@ class ReceiptExtractor: return name + def _get_store_profile(self, cui: Optional[str]) -> Optional[dict]: + """ + Get store-specific profile by CUI. + + DEPRECATED: Use ProfileRegistry.get_profile() directly for profile objects. + This method is kept for backward compatibility and returns validation hints dict. + + Args: + cui: The CUI extracted from receipt (with or without RO prefix) + + Returns: + Store profile validation hints dict or None if not found + """ + profile = ProfileRegistry.get_profile(cui) + if profile: + # Return validation hints for backward compatibility + hints = profile.get_validation_hints() + hints['name'] = profile.STORE_NAME + print(f"[Store Profile] Found profile for {cui}: {profile.STORE_NAME}", flush=True) + return hints + return None + def _extract_cui(self, text_upper: str, original_text: str) -> Tuple[Optional[str], float]: """ Extract vendor CUI (fiscal identification code) from text. @@ -1020,11 +1195,114 @@ class ReceiptExtractor: # Default to bon_fiscal if neither found return 'bon_fiscal' + def _try_pattern_lidl(self, text: str) -> List[dict]: + """ + Try Lidl-style TVA pattern: "TVA A 21,00% 7.71" (no hyphen/colon separator). + + Lidl receipts format: + TOTAL TVA 9,84 + TVA A 21,00% 7,71 + TVA B 11,00% 2,13 + + Returns list of TVA entries found. + """ + entries = [] + seen = set() + + # Pattern: TVA/TUA/IVA + code (A-D) + percent + amount (on same line) + # Handles: "TVA A 21,00% 7,71", "TVA B 11,00% 2,13", "TUA A 21% 7.71" + lidl_patterns = [ + # Same line: "TVA A 21,00% 7.71" (with various spacing) + r'T[VU][AR]\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)', + # Same line with backslash (OCR artifact): "TVA A \21,00% 7.71" + r'T[VU][AR]\s+([A-D])\s+\\?(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)', + # IVA variant + r'IVA\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)', + ] + + for pattern in lidl_patterns: + for match in re.finditer(pattern, text, re.IGNORECASE): + try: + code = match.group(1).upper() + percent = int(match.group(2)) + amount_str = self._normalize_number(match.group(3)) + amount = Decimal(amount_str) + + if amount > 0: + entry_key = (code, percent) + if entry_key not in seen: + entries.append({ + 'code': code, + 'percent': percent, + 'amount': amount + }) + seen.add(entry_key) + print(f"[TVA Lidl] Found: TVA {code} {percent}% = {amount}", flush=True) + except (ValueError, InvalidOperation): + continue + + return entries + + def _select_best_tva_candidate( + self, + candidates: List[tuple], + tva_bon_total: Optional[Decimal] + ) -> Tuple[List[dict], Optional[Decimal]]: + """ + Select the best TVA candidate from collected candidates. + + Selection criteria (priority order): + 1. Sum matches TOTAL TVA BON (highest priority) + 2. More entries = better (for multi-rate receipts) + 3. Pattern confidence as tiebreaker + + Args: + candidates: List of (pattern_name, confidence, entries, sum) + tva_bon_total: Authoritative TOTAL TVA BON value (if extracted) + + Returns: + (best_entries, best_sum) + """ + if not candidates: + return [], None + + # Score each candidate + scored = [] + for name, confidence, entries, sum_val in candidates: + score = 0.0 + + # Criterion 1: Sum matches TOTAL TVA BON (highest priority) + if tva_bon_total and sum_val: + tolerance = max(Decimal('0.02'), tva_bon_total * Decimal('0.02')) # 2% tolerance + if abs(sum_val - tva_bon_total) <= tolerance: + score += 100 # High bonus for matching authoritative total + print(f"[TVA Select] {name}: sum {sum_val} matches tva_bon_total {tva_bon_total}", flush=True) + + # Criterion 2: More entries (for multi-rate receipts) + score += len(entries) * 10 + + # Criterion 3: Pattern confidence + score += confidence * 5 + + scored.append((score, name, confidence, entries, sum_val)) + print(f"[TVA Select] Candidate {name}: score={score:.1f}, entries={len(entries)}, sum={sum_val}", flush=True) + + # Sort by score descending + scored.sort(key=lambda x: x[0], reverse=True) + best = scored[0] + print(f"[TVA Select] Winner: {best[1]} (score={best[0]:.1f})", flush=True) + + return best[3], best[4] + def _extract_tva_entries(self, text: str) -> Tuple[List[dict], Optional[Decimal]]: """ Extract multiple TVA (VAT) entries from text. Romanian receipts can have multiple TVA rates (A=19%, B=9%, C=5%, D=0%). + Uses CANDIDATE COLLECTION approach: + - Try ALL patterns and collect candidates + - Select best candidate based on matching TOTAL TVA BON + Returns (tva_entries, tva_total) where tva_entries is a list of: {'code': 'A', 'percent': 19, 'amount': Decimal('15.20')} """ @@ -1054,6 +1332,22 @@ class ReceiptExtractor: # Also normalize comma followed by space to comma (for "21, 00%" -> "21,00%") normalized_text = re.sub(r'(\d+),\s+(\d{2})\s*%', r'\1.\2%', normalized_text) + # Extract TOTAL TVA BON/TOTAL TVA first as the authoritative reference + tva_bon_total = self._extract_total_tva_bon(normalized_text) + print(f"[TVA Debug] TOTAL TVA BON: {tva_bon_total}", flush=True) + + # CANDIDATE COLLECTION APPROACH: Try all patterns, collect candidates, select best + all_candidates = [] # List of (pattern_name, confidence, entries, sum) + + # === LIDL-STYLE PATTERNS (NEW) === + # Lidl format: "TVA A 21,00% 7.71" or "TVA B 11,00% 2.13" (no hyphen/colon) + # This pattern handles multi-rate TVA receipts + lidl_entries = self._try_pattern_lidl(normalized_text) + if lidl_entries: + lidl_sum = sum(e['amount'] for e in lidl_entries) + all_candidates.append(('lidl', 0.96, lidl_entries, lidl_sum)) + print(f"[TVA Debug] Lidl pattern: {len(lidl_entries)} entries, sum={lidl_sum}", flush=True) + # Pattern 0a: First try to get TVA from "TOTAL TAXE:" which is most reliable # Format: "TOTAL TAXE: 55,22" - this is always the TVA amount # OCR may cut "T" producing "OTAL TAXE:" instead of "TOTAL TAXE:" @@ -1372,10 +1666,21 @@ class ReceiptExtractor: except (ValueError, InvalidOperation): continue - # Extract TOTAL TVA BON as reference (separate from individual entries) - tva_bon_total = self._extract_total_tva_bon(normalized_text) + # Add existing extraction results to candidates (if any) + if tva_entries: + entries_sum = sum(entry['amount'] for entry in tva_entries) + all_candidates.append(('standard', 0.90, tva_entries, entries_sum)) + print(f"[TVA Debug] Standard patterns: {len(tva_entries)} entries, sum={entries_sum}", flush=True) - # Calculate sum from entries + # === CANDIDATE SELECTION === + # Select best candidate using TOTAL TVA BON as authoritative reference + if all_candidates: + best_entries, best_sum = self._select_best_tva_candidate(all_candidates, tva_bon_total) + if best_entries: + tva_entries = best_entries + entries_sum = best_sum + + # Calculate sum from entries (if not set by candidate selection) entries_sum = None if tva_entries: entries_sum = sum(entry['amount'] for entry in tva_entries) diff --git a/scripts/generate_store_profile.py b/scripts/generate_store_profile.py new file mode 100755 index 0000000..86f37a4 --- /dev/null +++ b/scripts/generate_store_profile.py @@ -0,0 +1,600 @@ +#!/usr/bin/env python3 +""" +Store Profile Generator Script + +Analyzes PDF receipts from a store and generates a Python profile class +for the OCR extraction system. + +Usage: + python scripts/generate_store_profile.py \ + --name "Magazin Exemplu" \ + --cui "12345678" \ + --receipts "docs/data-entry/MagazinExemplu*.pdf" \ + --output "backend/modules/data_entry/services/ocr/profiles/magazin_exemplu.py" + +Features: + - Submits PDFs to OCR API + - Analyzes extracted text for patterns (TVA, total, date, payment) + - Generates a BaseStoreProfile subclass with detected patterns + - Supports hot-reload via ProfileRegistry + +Requirements: + - Backend server running on localhost:8000 + - JWT authentication + - python-jose, requests packages +""" + +import argparse +import glob +import json +import os +import re +import sys +import time +from collections import Counter, defaultdict +from datetime import datetime, timedelta, timezone +from pathlib import Path +from typing import Dict, List, Optional, Tuple + +try: + import requests + from jose import jwt +except ImportError: + print("Error: Required packages not installed.") + print("Run: pip install python-jose requests") + sys.exit(1) + + +# Configuration +API_BASE = os.getenv("API_BASE", "http://localhost:8000") +JWT_SECRET = os.getenv("JWT_SECRET_KEY", "GENERATE_NEW_SECRET_FOR_PRODUCTION3334!") + + +def create_jwt_token() -> str: + """Create a test JWT token for API authentication.""" + payload = { + "username": "PROFILE_GENERATOR", + "user_id": 1, + "companies": ["604"], + "permissions": ["read", "write"], + "exp": datetime.now(timezone.utc) + timedelta(hours=1), + "iat": datetime.now(timezone.utc), + "type": "access" + } + return jwt.encode(payload, JWT_SECRET, algorithm="HS256") + + +def submit_ocr(pdf_path: str, token: str, api_base: str = API_BASE, timeout: int = 120) -> Optional[Dict]: + """ + Submit a PDF to OCR API and wait for result. + + Args: + pdf_path: Path to PDF file + token: JWT authentication token + api_base: API base URL + timeout: Max seconds to wait for completion + + Returns: + Extraction result dict or None on failure + """ + headers = {"Authorization": f"Bearer {token}"} + filename = os.path.basename(pdf_path) + + print(f" Submitting: {filename}...", end=" ", flush=True) + + try: + with open(pdf_path, "rb") as f: + files = {"file": (filename, f, "application/pdf")} + response = requests.post( + f"{api_base}/api/data-entry/ocr/extract?engine=doctr_plus", + files=files, + headers=headers, + timeout=30 + ) + + if response.status_code != 200: + print(f"FAILED (HTTP {response.status_code})") + return None + + job_data = response.json() + job_id = job_data.get("job_id") + + if not job_id: + print("FAILED (no job_id)") + return None + + # Poll for completion + start_time = time.time() + while time.time() - start_time < timeout: + poll_response = requests.get( + f"{api_base}/api/data-entry/ocr/jobs/{job_id}/wait?timeout=30", + headers=headers, + timeout=35 + ) + + if poll_response.status_code == 200: + job_result = poll_response.json() + status = job_result.get("status") + + if status == "completed": + elapsed = time.time() - start_time + print(f"OK ({elapsed:.1f}s)") + return job_result.get("result", {}) + elif status == "error": + print(f"ERROR: {job_result.get('error', 'Unknown')}") + return None + + time.sleep(2) + + print("TIMEOUT") + return None + + except Exception as e: + print(f"EXCEPTION: {e}") + return None + + +def analyze_tva_patterns(results: List[Dict]) -> Dict: + """ + Analyze TVA patterns from multiple extraction results. + + Returns: + Dict with detected patterns and statistics + """ + tva_entries = [] + raw_texts = [] + + for r in results: + if r.get("tva_entries"): + tva_entries.extend(r["tva_entries"]) + if r.get("raw_text"): + raw_texts.append(r["raw_text"]) + + # Analyze TVA code patterns (A, B, C, etc.) + codes = Counter(e.get("code") for e in tva_entries if e.get("code")) + + # Analyze TVA percentage patterns + percents = Counter(e.get("percent") for e in tva_entries if e.get("percent")) + + # Detect TVA format from raw text + tva_formats = defaultdict(int) + for text in raw_texts: + text_upper = text.upper() + + # Standard format: "TVA 19% 10.50" or "TVA: 19% 10.50" + if re.search(r'TVA\s*:?\s*\d{1,2}%', text_upper): + tva_formats["standard"] += 1 + + # Lidl format: "TVA A 21% 7.71" + if re.search(r'TVA\s+[A-D]\s+\d{1,2}', text_upper): + tva_formats["lidl_multi_rate"] += 1 + + # Table format: "BAZA TVA | % TVA | VALOARE TVA" + if re.search(r'BAZA\s+TVA', text_upper): + tva_formats["table"] += 1 + + # No TVA (neplatitor) + if re.search(r'NEPLATITOR|NON.?TVA', text_upper): + tva_formats["non_vat"] += 1 + + return { + "codes": dict(codes), + "percents": dict(percents), + "formats": dict(tva_formats), + "has_multi_rate": len(codes) > 1, + "is_non_vat": tva_formats.get("non_vat", 0) > 0, + "dominant_format": max(tva_formats, key=tva_formats.get) if tva_formats else "standard" + } + + +def analyze_total_patterns(results: List[Dict]) -> Dict: + """Analyze TOTAL patterns from extraction results.""" + totals = [] + raw_texts = [] + + for r in results: + if r.get("amount"): + totals.append(float(r["amount"])) + if r.get("raw_text"): + raw_texts.append(r["raw_text"]) + + total_formats = defaultdict(int) + for text in raw_texts: + text_upper = text.upper() + + if re.search(r'TOTAL\s*:?\s*[\d.,]+', text_upper): + total_formats["TOTAL:"] += 1 + if re.search(r'TOTAL\s+DE\s+PLAT', text_upper): + total_formats["TOTAL DE PLATA"] += 1 + if re.search(r'SUMA\s+TOTAL', text_upper): + total_formats["SUMA TOTALA"] += 1 + if re.search(r'GRAND\s*TOTAL', text_upper): + total_formats["GRAND TOTAL"] += 1 + + return { + "count": len(totals), + "formats": dict(total_formats), + "dominant_format": max(total_formats, key=total_formats.get) if total_formats else "TOTAL" + } + + +def analyze_date_patterns(results: List[Dict]) -> Dict: + """Analyze date patterns from extraction results.""" + dates = [] + raw_texts = [] + + for r in results: + if r.get("receipt_date"): + dates.append(r["receipt_date"]) + if r.get("raw_text"): + raw_texts.append(r["raw_text"]) + + date_formats = defaultdict(int) + for text in raw_texts: + # DD.MM.YYYY + if re.search(r'\d{2}\.\d{2}\.\d{4}', text): + date_formats["DD.MM.YYYY"] += 1 + # YYYY.MM.DD (OMV/SOCAR style) + if re.search(r'\d{4}\.\d{2}\.\d{2}', text): + date_formats["YYYY.MM.DD"] += 1 + # DD-MM-YYYY + if re.search(r'\d{2}-\d{2}-\d{4}', text): + date_formats["DD-MM-YYYY"] += 1 + # DD/MM/YYYY + if re.search(r'\d{2}/\d{2}/\d{4}', text): + date_formats["DD/MM/YYYY"] += 1 + + return { + "extracted_dates": dates, + "formats": dict(date_formats), + "dominant_format": max(date_formats, key=date_formats.get) if date_formats else "DD.MM.YYYY" + } + + +def analyze_payment_patterns(results: List[Dict]) -> Dict: + """Analyze payment method patterns.""" + payment_counts = defaultdict(int) + + for r in results: + methods = r.get("payment_methods", []) + for m in methods: + method_type = m.get("method", "UNKNOWN") + payment_counts[method_type] += 1 + + return { + "methods": dict(payment_counts), + "has_mixed_payments": len(payment_counts) > 1 + } + + +def analyze_client_patterns(results: List[Dict]) -> Dict: + """Analyze client (B2B) patterns.""" + has_client_cui = 0 + has_client_name = 0 + + for r in results: + if r.get("client_cui"): + has_client_cui += 1 + if r.get("client_name"): + has_client_name += 1 + + return { + "has_client_cui": has_client_cui > 0, + "has_client_name": has_client_name > 0, + "b2b_ratio": has_client_cui / len(results) if results else 0 + } + + +def generate_profile_code( + store_name: str, + cui: str, + tva_analysis: Dict, + total_analysis: Dict, + date_analysis: Dict, + payment_analysis: Dict, + client_analysis: Dict +) -> str: + """ + Generate Python profile class code. + + Args: + store_name: Human-readable store name + cui: CUI number (without RO prefix) + *_analysis: Analysis results from pattern detection + + Returns: + Python source code for the profile class + """ + # Generate class name from store name + class_name = "".join( + word.capitalize() + for word in re.sub(r'[^a-zA-Z0-9\s]', '', store_name).split() + ) + "Profile" + + # Generate module name + module_name = re.sub(r'[^a-z0-9]', '_', store_name.lower()).strip('_') + + # Determine profile characteristics + is_non_vat = tva_analysis.get("is_non_vat", False) + has_multi_rate = tva_analysis.get("has_multi_rate", False) + has_client_cui = client_analysis.get("has_client_cui", False) + uses_yyyy_mm_dd = date_analysis.get("dominant_format") == "YYYY.MM.DD" + + # Generate OCR name patterns + name_words = store_name.upper().split() + primary_word = name_words[0] if name_words else store_name.upper() + name_patterns = [ + primary_word, + store_name.upper().replace(".", "").replace(",", ""), + ] + # Add OCR error variants + ocr_variants = { + 'O': '0', 'I': '1', 'L': '1', 'S': '5', 'B': '8', 'E': '3' + } + for char, replacement in ocr_variants.items(): + if char in primary_word: + name_patterns.append(primary_word.replace(char, replacement, 1)) + + name_patterns = list(dict.fromkeys(name_patterns))[:4] # Unique, max 4 + + # Build the code + code_lines = [ + '"""', + f'{store_name} store profile for OCR extraction.', + '', + 'Auto-generated by generate_store_profile.py', + f'Generated: {datetime.now().strftime("%Y-%m-%d %H:%M")}', + '"""', + '', + 'import re', + 'from decimal import Decimal, InvalidOperation', + 'from typing import List, Dict, Any', + '', + 'from .base import BaseStoreProfile', + 'from . import ProfileRegistry', + '', + '', + '@ProfileRegistry.register', + f'class {class_name}(BaseStoreProfile):', + ' """', + f' {store_name} - OCR extraction profile.', + ' ', + ] + + # Add characteristics to docstring + characteristics = [] + if is_non_vat: + characteristics.append("Non-VAT payer (neplatitor TVA)") + if has_multi_rate: + characteristics.append("Multi-rate TVA") + if has_client_cui: + characteristics.append("B2B receipts with client CUI") + if uses_yyyy_mm_dd: + characteristics.append("Date format: YYYY.MM.DD") + + if characteristics: + code_lines.append(' Key characteristics:') + for c in characteristics: + code_lines.append(f' - {c}') + code_lines.append(' ') + + code_lines.extend([ + ' """', + '', + f' CUI_LIST = ["{cui}"]', + f' NAME_PATTERNS = {name_patterns}', + f' STORE_NAME = "{store_name}"', + '', + ]) + + # Add date patterns override for YYYY.MM.DD format + if uses_yyyy_mm_dd: + code_lines.extend([ + ' # Override date patterns for YYYY.MM.DD format', + ' DATE_PATTERNS_OCR_SPACES = [', + ' r\'(\\d{4})[.,]\\s*(\\d{2})[.,]\\s*(\\d{2})\', # YYYY. MM. DD with spaces', + ' r\'(\\d{4})[.,](\\d{2})[.,](\\d{2})\', # YYYY.MM.DD', + ' ]', + '', + ]) + + # Add TVA extraction method for multi-rate or non-VAT + if is_non_vat: + code_lines.extend([ + ' def extract_tva_entries(self, text: str) -> List[dict]:', + ' """Non-VAT payer - returns empty list."""', + ' return []', + '', + ]) + elif has_multi_rate and tva_analysis.get("dominant_format") == "lidl_multi_rate": + code_lines.extend([ + ' # Store-specific TVA patterns', + ' TVA_PATTERNS = [', + ' r\'T[VU][AR]\\s+([A-D])\\s+(\\d{1,2})[.,]?\\d{0,2}\\s*%\\s+([\\d.,]+)\',', + ' ]', + '', + ' def extract_tva_entries(self, text: str) -> List[dict]:', + ' """Extract multi-rate TVA entries."""', + ' entries = []', + ' seen = set()', + '', + ' for pattern in self.TVA_PATTERNS:', + ' for match in re.finditer(pattern, text, re.IGNORECASE):', + ' try:', + ' code = match.group(1).upper()', + ' percent = int(match.group(2))', + ' amount = self._parse_decimal(match.group(3))', + '', + ' if amount and amount > 0:', + ' entry_key = (code, percent)', + ' if entry_key not in seen:', + ' entries.append({', + ' \'code\': code,', + ' \'percent\': percent,', + ' \'amount\': amount', + ' })', + ' seen.add(entry_key)', + ' except (ValueError, InvalidOperation):', + ' continue', + '', + ' return entries', + '', + ]) + + # Add validation hints method + code_lines.extend([ + ' def get_validation_hints(self) -> Dict[str, Any]:', + f' """Return {store_name}-specific validation hints."""', + ' return {', + f' "has_multi_rate_tva": {has_multi_rate},', + f' "card_equals_total": True,', + f' "has_client_cui": {has_client_cui},', + f' "has_efactura": False,', + f' "is_non_vat_payer": {is_non_vat},', + ' }', + ]) + + return '\n'.join(code_lines) + '\n' + + +def main(): + parser = argparse.ArgumentParser( + description="Generate store profile from PDF receipts", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + # Generate profile from a single PDF + python scripts/generate_store_profile.py \\ + --name "Magazin Nou" --cui "12345678" \\ + --receipts "docs/data-entry/magazin_nou.pdf" + + # Generate profile from multiple PDFs (glob pattern) + python scripts/generate_store_profile.py \\ + --name "Carrefour" --cui "2475489" \\ + --receipts "docs/data-entry/Carrefour*.pdf" \\ + --output backend/modules/data_entry/services/ocr/profiles/carrefour.py + + # Dry run (analyze only, don't write file) + python scripts/generate_store_profile.py \\ + --name "Test Store" --cui "11111111" \\ + --receipts "docs/data-entry/test*.pdf" \\ + --dry-run + """ + ) + + parser.add_argument("--name", required=True, help="Store name (e.g., 'LIDL DISCOUNT S.R.L.')") + parser.add_argument("--cui", required=True, help="CUI number without RO prefix") + parser.add_argument("--receipts", required=True, help="PDF file path or glob pattern") + parser.add_argument("--output", help="Output file path (default: auto-generated)") + parser.add_argument("--dry-run", action="store_true", help="Analyze only, don't write file") + parser.add_argument("--api-base", default=API_BASE, help=f"API base URL (default: {API_BASE})") + + args = parser.parse_args() + + # Update API base if provided + api_base = args.api_base + + # Validate CUI format + cui = args.cui.strip().replace("RO", "").replace(" ", "") + if not cui.isdigit() or len(cui) < 6 or len(cui) > 10: + print(f"Error: Invalid CUI format: {args.cui}") + sys.exit(1) + + # Find PDF files + pdf_files = glob.glob(args.receipts) + if not pdf_files: + print(f"Error: No PDF files found matching: {args.receipts}") + sys.exit(1) + + print(f"\n{'='*60}") + print(f"Store Profile Generator") + print(f"{'='*60}") + print(f"Store: {args.name}") + print(f"CUI: {cui}") + print(f"PDFs: {len(pdf_files)} files") + print(f"{'='*60}\n") + + # Generate JWT token + token = create_jwt_token() + + # Submit PDFs to OCR + print("Step 1: Submitting PDFs to OCR API...") + results = [] + for pdf_path in pdf_files: + result = submit_ocr(pdf_path, token, api_base=api_base) + if result: + results.append(result) + + if not results: + print("\nError: No successful extractions. Check if backend is running.") + sys.exit(1) + + print(f"\nSuccessfully extracted: {len(results)}/{len(pdf_files)} PDFs") + + # Analyze patterns + print("\nStep 2: Analyzing patterns...") + tva_analysis = analyze_tva_patterns(results) + total_analysis = analyze_total_patterns(results) + date_analysis = analyze_date_patterns(results) + payment_analysis = analyze_payment_patterns(results) + client_analysis = analyze_client_patterns(results) + + print(f" TVA: {tva_analysis['dominant_format']} format, multi-rate={tva_analysis['has_multi_rate']}") + print(f" Date: {date_analysis['dominant_format']} format") + print(f" Payments: {list(payment_analysis['methods'].keys())}") + print(f" B2B: {client_analysis['has_client_cui']}") + + # Generate profile code + print("\nStep 3: Generating profile code...") + code = generate_profile_code( + store_name=args.name, + cui=cui, + tva_analysis=tva_analysis, + total_analysis=total_analysis, + date_analysis=date_analysis, + payment_analysis=payment_analysis, + client_analysis=client_analysis + ) + + # Determine output path + if args.output: + output_path = args.output + else: + module_name = re.sub(r'[^a-z0-9]', '_', args.name.lower()).strip('_') + output_path = f"backend/modules/data_entry/services/ocr/profiles/{module_name}.py" + + if args.dry_run: + print(f"\n[DRY RUN] Would write to: {output_path}") + print(f"\n{'='*60}") + print("Generated code:") + print(f"{'='*60}") + print(code) + else: + # Write file + os.makedirs(os.path.dirname(output_path), exist_ok=True) + with open(output_path, 'w') as f: + f.write(code) + print(f" Written to: {output_path}") + + # Verify syntax + import py_compile + try: + py_compile.compile(output_path, doraise=True) + print(f" Syntax check: OK") + except py_compile.PyCompileError as e: + print(f" Syntax check: FAILED - {e}") + + print(f"\n{'='*60}") + print("Profile generation complete!") + print(f"{'='*60}") + + if not args.dry_run: + print(f"\nNext steps:") + print(f"1. Review the generated code: {output_path}") + print(f"2. Customize patterns if needed") + print(f"3. Hot-reload profiles: curl -X POST http://localhost:8000/api/data-entry/ocr/profiles/reload") + print(f"4. Test with a sample receipt") + + +if __name__ == "__main__": + main()