feat(ocr): Add modular store profiles with hot-reload support
## Store Profiles System
- Add ProfileRegistry for CUI-based profile lookup
- Add BaseStoreProfile with generic extraction patterns
- Implement hot-reload via POST /api/data-entry/ocr/profiles/reload
## 12 Store Profiles
- LIDL: Multi-rate TVA (A, B, C, D codes)
- OMV, SOCAR: B2B with client CUI, YYYY.MM.DD dates
- BRICK, DEDEMAN: Standard TVA, e-factura support
- KINETERRA, BEST PRINT: Non-VAT payers (returns [])
- STEPOUT MARKET: TVA 5% (books/reduced rate)
- UNLIMITED KEYS: NUMERAR payment detection
- GAMA INK, ELECTROBERING, PICTUS VELUM: Standard TVA
## Flexible TVA Patterns
- All patterns use (\d{1,2})% to accept any rate
- Supports historical (19%, 9%, 5%) and current (21%, 11%)
## Payment Methods Fix
- Fixed base.py to support multiple payments of same type
- Changed deduplication from method-only to (method, amount) tuple
- Returns separate entries for split payments
## Tools
- Add generate_store_profile.py for automatic profile generation
- Analyzes PDFs via OCR API and detects patterns
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
106
.claude/rules/claude-learn-backend.md
Normal file
106
.claude/rules/claude-learn-backend.md
Normal file
@@ -0,0 +1,106 @@
|
||||
# Claude Learn: Backend
|
||||
|
||||
**Domain**: backend
|
||||
**Last updated**: 2026-01-06
|
||||
**Sessions recorded**: 1
|
||||
|
||||
Knowledge about FastAPI, Python services, Oracle DB, and backend architecture.
|
||||
|
||||
---
|
||||
|
||||
## Patterns
|
||||
|
||||
### ProfileRegistry cu Hot-Reload pentru Store Profiles
|
||||
**Discovered**: 2026-01-06 (feature: ocr-store-profiles)
|
||||
**Description**: Sistem de înregistrare profile OCR folosind decorator `@ProfileRegistry.register` cu hot-reload via `importlib.reload()`. Permite adăugarea/modificarea profilelor fără restart server.
|
||||
|
||||
**Example** (`backend/modules/data_entry/services/ocr/profiles/__init__.py`):
|
||||
```python
|
||||
class ProfileRegistry:
|
||||
_profiles: Dict[str, Type["BaseStoreProfile"]] = {}
|
||||
_instances: Dict[str, "BaseStoreProfile"] = {}
|
||||
|
||||
@classmethod
|
||||
def register(cls, profile_class):
|
||||
"""Decorator to register a store profile class."""
|
||||
for cui in profile_class.CUI_LIST:
|
||||
cls._profiles[cls._normalize_cui(cui)] = profile_class
|
||||
return profile_class
|
||||
|
||||
@classmethod
|
||||
def reload_all(cls):
|
||||
"""Hot-reload all profile modules via importlib.reload()."""
|
||||
cls._instances.clear()
|
||||
for module_name in cls._get_profile_module_names():
|
||||
importlib.reload(sys.modules[f"backend...profiles.{module_name}"])
|
||||
```
|
||||
|
||||
**Usage**:
|
||||
```python
|
||||
@ProfileRegistry.register
|
||||
class LidlProfile(BaseStoreProfile):
|
||||
CUI_LIST = ["22891860"]
|
||||
STORE_NAME = "LIDL DISCOUNT S.R.L."
|
||||
|
||||
# Lookup
|
||||
profile = ProfileRegistry.get_profile("22891860")
|
||||
|
||||
# Hot-reload (endpoint)
|
||||
POST /api/data-entry/ocr/profiles/reload
|
||||
```
|
||||
|
||||
**Tags**: registry-pattern, hot-reload, decorator, ocr, singleton
|
||||
|
||||
---
|
||||
|
||||
### Script generare cod Python din analiză PDF
|
||||
**Discovered**: 2026-01-06 (feature: ocr-store-profiles)
|
||||
**Description**: Script care analizează PDF-uri via OCR API, detectează pattern-uri (TVA format, date format, payment) și generează automat cod Python pentru profile noi. Include JWT auth, async polling, și verificare sintaxă.
|
||||
|
||||
**Example** (`scripts/generate_store_profile.py`):
|
||||
```python
|
||||
def analyze_tva_patterns(results: List[Dict]) -> Dict:
|
||||
"""Detectează format TVA dominant din rezultatele OCR."""
|
||||
tva_formats = defaultdict(int)
|
||||
for text in raw_texts:
|
||||
if re.search(r'TVA\s+[A-D]\s+\d{1,2}', text_upper):
|
||||
tva_formats["lidl_multi_rate"] += 1
|
||||
if re.search(r'BAZA\s+TVA', text_upper):
|
||||
tva_formats["table"] += 1
|
||||
return {"dominant_format": max(tva_formats, key=tva_formats.get)}
|
||||
|
||||
def generate_profile_code(store_name, cui, tva_analysis, ...):
|
||||
"""Generează cod Python pentru clasa de profil."""
|
||||
# Template-based generation cu OCR error variants
|
||||
```
|
||||
|
||||
**Usage**:
|
||||
```bash
|
||||
# Dry-run pentru preview
|
||||
python scripts/generate_store_profile.py \
|
||||
--name "Magazin Nou" --cui "12345678" \
|
||||
--receipts "docs/data-entry/MagazinNou*.pdf" --dry-run
|
||||
|
||||
# Generează și salvează
|
||||
python scripts/generate_store_profile.py \
|
||||
--name "Magazin Nou" --cui "12345678" \
|
||||
--receipts "docs/data-entry/MagazinNou*.pdf" \
|
||||
--output backend/.../profiles/magazin_nou.py
|
||||
```
|
||||
|
||||
**Tags**: code-generation, ocr, automation, cli-tool
|
||||
|
||||
---
|
||||
|
||||
## Gotchas
|
||||
|
||||
_(None recorded yet)_
|
||||
|
||||
---
|
||||
|
||||
## Statistics
|
||||
|
||||
- **Total Patterns**: 2
|
||||
- **Total Gotchas**: 0
|
||||
- **Last Session**: 2026-01-06
|
||||
- **Sessions Recorded**: 1
|
||||
28
.claude/rules/claude-learn-database.md
Normal file
28
.claude/rules/claude-learn-database.md
Normal file
@@ -0,0 +1,28 @@
|
||||
# Claude Learn: Database
|
||||
|
||||
**Domain**: database
|
||||
**Last updated**: -
|
||||
**Sessions recorded**: 0
|
||||
|
||||
Knowledge about Oracle DB, SQLite, SQLModel, migrations, and data modeling.
|
||||
|
||||
---
|
||||
|
||||
## Patterns
|
||||
|
||||
_(None recorded yet)_
|
||||
|
||||
---
|
||||
|
||||
## Gotchas
|
||||
|
||||
_(None recorded yet)_
|
||||
|
||||
---
|
||||
|
||||
## Statistics
|
||||
|
||||
- **Total Patterns**: 0
|
||||
- **Total Gotchas**: 0
|
||||
- **Last Session**: -
|
||||
- **Sessions Recorded**: 0
|
||||
55
.claude/rules/claude-learn-deployment.md
Normal file
55
.claude/rules/claude-learn-deployment.md
Normal file
@@ -0,0 +1,55 @@
|
||||
# Claude Learn: Deployment
|
||||
|
||||
**Domain**: deployment
|
||||
**Last updated**: 2026-01-06
|
||||
**Sessions recorded**: 1
|
||||
|
||||
Knowledge about IIS, Docker, deployment scripts, and infrastructure.
|
||||
|
||||
---
|
||||
|
||||
## Patterns
|
||||
|
||||
### IIS URL Rewrite Rules for SPA with Multiple API Backends
|
||||
**Discovered**: 2025-12-22 (feature: unified-app)
|
||||
**Description**: Configure IIS web.config to proxy different API paths to different backend ports while serving SPA for all other routes. Enables single IIS site to route to multiple microservices.
|
||||
|
||||
**Example** (`public/web.config:5-28`):
|
||||
```xml
|
||||
<rewrite>
|
||||
<rules>
|
||||
<rule name="Proxy Reports API" stopProcessing="true">
|
||||
<match url="^api/reports/(.*)" />
|
||||
<action type="Rewrite" url="http://localhost:8001/api/{R:1}" />
|
||||
</rule>
|
||||
<rule name="Proxy Data Entry API" stopProcessing="true">
|
||||
<match url="^api/data-entry/(.*)" />
|
||||
<action type="Rewrite" url="http://localhost:8003/api/{R:1}" />
|
||||
</rule>
|
||||
<rule name="SPA Fallback" stopProcessing="true">
|
||||
<match url=".*" />
|
||||
<conditions logicalGrouping="MatchAll">
|
||||
<add input="{REQUEST_FILENAME}" matchType="IsFile" negate="true" />
|
||||
</conditions>
|
||||
<action type="Rewrite" url="/index.html" />
|
||||
</rule>
|
||||
</rules>
|
||||
</rewrite>
|
||||
```
|
||||
|
||||
**Tags**: iis, deployment, spa, microservices, proxy
|
||||
|
||||
---
|
||||
|
||||
## Gotchas
|
||||
|
||||
_(None recorded yet)_
|
||||
|
||||
---
|
||||
|
||||
## Statistics
|
||||
|
||||
- **Total Patterns**: 1
|
||||
- **Total Gotchas**: 0
|
||||
- **Last Session**: 2026-01-06
|
||||
- **Sessions Recorded**: 1
|
||||
83
.claude/rules/claude-learn-domains.md
Normal file
83
.claude/rules/claude-learn-domains.md
Normal file
@@ -0,0 +1,83 @@
|
||||
# Claude Learn Domains Configuration
|
||||
|
||||
**Last updated**: 2026-01-06
|
||||
|
||||
This file defines available knowledge domains and their file path patterns.
|
||||
|
||||
---
|
||||
|
||||
## Domains
|
||||
|
||||
### frontend
|
||||
**File**: `claude-learn-frontend.md`
|
||||
**Patterns**:
|
||||
- `src/**/*.vue`
|
||||
- `src/**/*.js`
|
||||
- `src/**/*.ts`
|
||||
- `src/**/*.css`
|
||||
- `vite.config.*`
|
||||
- `package.json`
|
||||
|
||||
---
|
||||
|
||||
### backend
|
||||
**File**: `claude-learn-backend.md`
|
||||
**Patterns**:
|
||||
- `backend/**/*.py`
|
||||
- `backend/modules/**/*`
|
||||
- `requirements.txt`
|
||||
|
||||
---
|
||||
|
||||
### database
|
||||
**File**: `claude-learn-database.md`
|
||||
**Patterns**:
|
||||
- `**/*.sql`
|
||||
- `**/models.py`
|
||||
- `**/schemas.py`
|
||||
- `backend/**/db/**/*`
|
||||
|
||||
---
|
||||
|
||||
### testing
|
||||
**File**: `claude-learn-testing.md`
|
||||
**Patterns**:
|
||||
- `tests/**/*`
|
||||
- `**/*.test.*`
|
||||
- `**/*.spec.*`
|
||||
- `pytest.ini`
|
||||
- `vitest.config.*`
|
||||
|
||||
---
|
||||
|
||||
### deployment
|
||||
**File**: `claude-learn-deployment.md`
|
||||
**Patterns**:
|
||||
- `deployment/**/*`
|
||||
- `public/web.config`
|
||||
- `Dockerfile*`
|
||||
- `docker-compose*.yml`
|
||||
- `*.sh`
|
||||
- `ansible/**/*`
|
||||
|
||||
---
|
||||
|
||||
### global
|
||||
**File**: `claude-learn-global.md`
|
||||
**Patterns**:
|
||||
- `*` (catch-all for cross-cutting concerns)
|
||||
|
||||
---
|
||||
|
||||
## Statistics
|
||||
|
||||
| Domain | Patterns | Gotchas | Last Updated |
|
||||
|--------|----------|---------|--------------|
|
||||
| frontend | 8 | 10 | 2026-01-06 |
|
||||
| deployment | 1 | 0 | 2026-01-06 |
|
||||
| global | 0 | 1 | 2026-01-06 |
|
||||
| backend | 2 | 0 | 2026-01-06 |
|
||||
| database | 0 | 0 | - |
|
||||
| testing | 0 | 0 | - |
|
||||
|
||||
**Total**: 11 patterns, 11 gotchas across 4 domains
|
||||
@@ -1,9 +1,10 @@
|
||||
# Learned Patterns & Gotchas
|
||||
# Claude Learn: Frontend
|
||||
|
||||
**Last updated**: 2025-12-24
|
||||
**Maintained**: Manually (add new patterns/gotchas as discovered)
|
||||
**Domain**: frontend
|
||||
**Last updated**: 2026-01-06
|
||||
**Sessions recorded**: 3
|
||||
|
||||
This file contains insights learned during feature implementations. Claude Code auto-loads this file to prevent repeating past mistakes.
|
||||
Knowledge about Vue.js, Vite, Pinia, CSS, and frontend architecture.
|
||||
|
||||
---
|
||||
|
||||
@@ -130,37 +131,6 @@ resolve: {
|
||||
|
||||
---
|
||||
|
||||
### IIS URL Rewrite Rules for SPA with Multiple API Backends
|
||||
**Discovered**: 2025-12-22 (feature: unified-app)
|
||||
**Description**: Configure IIS web.config to proxy different API paths to different backend ports while serving SPA for all other routes. Enables single IIS site to route to multiple microservices.
|
||||
|
||||
**Example** (`public/web.config:5-28`):
|
||||
```xml
|
||||
<rewrite>
|
||||
<rules>
|
||||
<rule name="Proxy Reports API" stopProcessing="true">
|
||||
<match url="^api/reports/(.*)" />
|
||||
<action type="Rewrite" url="http://localhost:8001/api/{R:1}" />
|
||||
</rule>
|
||||
<rule name="Proxy Data Entry API" stopProcessing="true">
|
||||
<match url="^api/data-entry/(.*)" />
|
||||
<action type="Rewrite" url="http://localhost:8003/api/{R:1}" />
|
||||
</rule>
|
||||
<rule name="SPA Fallback" stopProcessing="true">
|
||||
<match url=".*" />
|
||||
<conditions logicalGrouping="MatchAll">
|
||||
<add input="{REQUEST_FILENAME}" matchType="IsFile" negate="true" />
|
||||
</conditions>
|
||||
<action type="Rewrite" url="/index.html" />
|
||||
</rule>
|
||||
</rules>
|
||||
</rewrite>
|
||||
```
|
||||
|
||||
**Tags**: iis, deployment, spa, microservices, proxy
|
||||
|
||||
---
|
||||
|
||||
### Vue Watcher for Auto-Loading Dependent Data
|
||||
**Discovered**: 2025-12-24 (feature: unified-app-ux)
|
||||
**Description**: Use Vue watch() to automatically trigger data loading when dependent selections change. Watch company selection changes to auto-load accounting periods, ensuring UI stays synchronized without manual intervention.
|
||||
@@ -248,15 +218,6 @@ const getStorageKey = () => {
|
||||
|
||||
---
|
||||
|
||||
### Sed Command Quote Mismatch in Bulk Find-Replace
|
||||
**Discovered**: 2025-12-22 (feature: unified-app)
|
||||
**Problem**: Bulk sed commands using single quotes in pattern didn't match imports using double quotes, and vice versa. Commands like sed 's|from '@/stores/'|...' didn't replace from "@/stores/" lines.
|
||||
**Solution**: Always use the quote style that matches the target files. For Vue/JS files with ESLint using double quotes, use double quotes in sed patterns. Better yet: use find -exec with separate sed for each file to handle both quote styles.
|
||||
|
||||
**Tags**: sed, regex, scripting, find-replace, migration
|
||||
|
||||
---
|
||||
|
||||
### Circular Reference in API Wrapper
|
||||
**Discovered**: 2025-12-22 (feature: unified-app)
|
||||
**Problem**: receiptsStore.js failed to build with 'Identifier api has already been declared' because it imported api and then declared const api = { ... } wrapper object using the same name.
|
||||
@@ -287,7 +248,7 @@ const getStorageKey = () => {
|
||||
### Vite Build Transform Count is Progress Indicator
|
||||
**Discovered**: 2025-12-22 (feature: unified-app)
|
||||
**Problem**: Hard to tell if build is making progress when fixing import issues. Each fix revealed new errors, causing frustration.
|
||||
**Solution**: Watch the 'transforming... ✓ N modules transformed' count - it increases with each successful fix even if build ultimately fails. Going from 200→573→1490→1492 modules meant we were getting close to success. Use this as encouragement!
|
||||
**Solution**: Watch the 'transforming... N modules transformed' count - it increases with each successful fix even if build ultimately fails. Going from 200->573->1490->1492 modules meant we were getting close to success. Use this as encouragement!
|
||||
|
||||
**Tags**: vite, build, debugging, progress-tracking, developer-experience
|
||||
|
||||
@@ -329,9 +290,9 @@ const getStorageKey = () => {
|
||||
|
||||
---
|
||||
|
||||
## Memory Statistics
|
||||
## Statistics
|
||||
|
||||
- **Total Patterns**: 9
|
||||
- **Total Gotchas**: 11
|
||||
- **Last Session**: 2025-12-24 (unified-app-ux)
|
||||
- **Sessions Recorded**: 2
|
||||
- **Total Patterns**: 8
|
||||
- **Total Gotchas**: 10
|
||||
- **Last Session**: 2026-01-06
|
||||
- **Sessions Recorded**: 3
|
||||
33
.claude/rules/claude-learn-global.md
Normal file
33
.claude/rules/claude-learn-global.md
Normal file
@@ -0,0 +1,33 @@
|
||||
# Claude Learn: Global
|
||||
|
||||
**Domain**: global
|
||||
**Last updated**: 2026-01-06
|
||||
**Sessions recorded**: 1
|
||||
|
||||
Cross-cutting knowledge applicable to multiple domains (scripting, tooling, workflow).
|
||||
|
||||
---
|
||||
|
||||
## Patterns
|
||||
|
||||
_(None recorded yet)_
|
||||
|
||||
---
|
||||
|
||||
## Gotchas
|
||||
|
||||
### Sed Command Quote Mismatch in Bulk Find-Replace
|
||||
**Discovered**: 2025-12-22 (feature: unified-app)
|
||||
**Problem**: Bulk sed commands using single quotes in pattern didn't match imports using double quotes, and vice versa. Commands like sed 's|from '@/stores/'|...' didn't replace from "@/stores/" lines.
|
||||
**Solution**: Always use the quote style that matches the target files. For Vue/JS files with ESLint using double quotes, use double quotes in sed patterns. Better yet: use find -exec with separate sed for each file to handle both quote styles.
|
||||
|
||||
**Tags**: sed, regex, scripting, find-replace, migration
|
||||
|
||||
---
|
||||
|
||||
## Statistics
|
||||
|
||||
- **Total Patterns**: 0
|
||||
- **Total Gotchas**: 1
|
||||
- **Last Session**: 2026-01-06
|
||||
- **Sessions Recorded**: 1
|
||||
28
.claude/rules/claude-learn-testing.md
Normal file
28
.claude/rules/claude-learn-testing.md
Normal file
@@ -0,0 +1,28 @@
|
||||
# Claude Learn: Testing
|
||||
|
||||
**Domain**: testing
|
||||
**Last updated**: -
|
||||
**Sessions recorded**: 0
|
||||
|
||||
Knowledge about pytest, Vitest, test patterns, and validation strategies.
|
||||
|
||||
---
|
||||
|
||||
## Patterns
|
||||
|
||||
_(None recorded yet)_
|
||||
|
||||
---
|
||||
|
||||
## Gotchas
|
||||
|
||||
_(None recorded yet)_
|
||||
|
||||
---
|
||||
|
||||
## Statistics
|
||||
|
||||
- **Total Patterns**: 0
|
||||
- **Total Gotchas**: 0
|
||||
- **Last Session**: -
|
||||
- **Sessions Recorded**: 0
|
||||
@@ -628,3 +628,86 @@ def _dict_to_extraction_data(data: dict) -> ExtractionData:
|
||||
validation_errors=data.get('validation_errors', []),
|
||||
inter_ocr_ratios=data.get('inter_ocr_ratios', {}),
|
||||
)
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Store Profiles Management Endpoints
|
||||
# ============================================================================
|
||||
|
||||
@router.post("/profiles/reload")
|
||||
async def reload_store_profiles(
|
||||
current_user: CurrentUser = Depends(get_current_user)
|
||||
) -> dict:
|
||||
"""
|
||||
Hot-reload all store profiles.
|
||||
|
||||
Reloads profile Python modules without server restart.
|
||||
Use after adding/modifying profile files.
|
||||
|
||||
Returns:
|
||||
Dict with reloaded count and profile list
|
||||
"""
|
||||
from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
|
||||
|
||||
count = ProfileRegistry.reload_all()
|
||||
status = ProfileRegistry.get_reload_status()
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"reloaded_modules": count,
|
||||
"profiles_count": status["profiles_count"],
|
||||
"registered_cuis": status["registered_cuis"],
|
||||
"last_reload": status["last_reload"],
|
||||
}
|
||||
|
||||
|
||||
@router.get("/profiles")
|
||||
async def list_store_profiles(
|
||||
current_user: CurrentUser = Depends(get_current_user)
|
||||
) -> dict:
|
||||
"""
|
||||
List all registered store profiles.
|
||||
|
||||
Returns:
|
||||
Dict with profiles list and status
|
||||
"""
|
||||
from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
|
||||
|
||||
profiles = ProfileRegistry.list_profiles()
|
||||
status = ProfileRegistry.get_reload_status()
|
||||
|
||||
return {
|
||||
"profiles": profiles,
|
||||
"count": len(profiles),
|
||||
"last_reload": status["last_reload"],
|
||||
}
|
||||
|
||||
|
||||
@router.get("/profiles/{cui}")
|
||||
async def get_store_profile(
|
||||
cui: str,
|
||||
current_user: CurrentUser = Depends(get_current_user)
|
||||
) -> dict:
|
||||
"""
|
||||
Get details for a specific store profile.
|
||||
|
||||
Args:
|
||||
cui: Store CUI (with or without RO prefix)
|
||||
|
||||
Returns:
|
||||
Profile details including validation hints
|
||||
|
||||
Raises:
|
||||
404: If no profile exists for this CUI
|
||||
"""
|
||||
from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
|
||||
|
||||
info = ProfileRegistry.get_profile_info(cui)
|
||||
|
||||
if not info:
|
||||
raise HTTPException(
|
||||
status_code=404,
|
||||
detail=f"No profile registered for CUI: {cui}"
|
||||
)
|
||||
|
||||
return info
|
||||
|
||||
258
backend/modules/data_entry/services/ocr/profiles/README.md
Normal file
258
backend/modules/data_entry/services/ocr/profiles/README.md
Normal file
@@ -0,0 +1,258 @@
|
||||
# Store Profiles - OCR Extraction
|
||||
|
||||
Sistem de profile specifice pentru extracție OCR cu hot-reload.
|
||||
|
||||
---
|
||||
|
||||
## Quick Start: Adaugă un profil nou
|
||||
|
||||
```bash
|
||||
# 1. Generează profil din PDF-uri (dry-run pentru preview)
|
||||
python scripts/generate_store_profile.py \
|
||||
--name "Magazin Nou SRL" \
|
||||
--cui "12345678" \
|
||||
--receipts "docs/data-entry/MagazinNou*.pdf" \
|
||||
--dry-run
|
||||
|
||||
# 2. Generează și salvează
|
||||
python scripts/generate_store_profile.py \
|
||||
--name "Magazin Nou SRL" \
|
||||
--cui "12345678" \
|
||||
--receipts "docs/data-entry/MagazinNou*.pdf" \
|
||||
--output backend/modules/data_entry/services/ocr/profiles/magazin_nou.py
|
||||
|
||||
# 3. Hot-reload (fără restart server)
|
||||
curl -X POST http://localhost:8000/api/data-entry/ocr/profiles/reload
|
||||
|
||||
# 4. Verifică
|
||||
curl http://localhost:8000/api/data-entry/ocr/profiles
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Structura directorului
|
||||
|
||||
```
|
||||
profiles/
|
||||
├── __init__.py # ProfileRegistry + hot-reload (~390 linii)
|
||||
├── base.py # BaseStoreProfile + pattern-uri generice (~410 linii)
|
||||
├── lidl.py # Multi-rate TVA (A/B)
|
||||
├── omv.py # B2B, date YYYY.MM.DD
|
||||
├── socar.py # B2B, date YYYY.MM.DD
|
||||
├── brick.py # Standard TVA
|
||||
├── dedeman.py # E-factura support
|
||||
├── kineterra.py # Non-VAT payer
|
||||
├── gama_ink.py # Standard TVA (toner/cartușe)
|
||||
├── electrobering.py # Standard TVA (electronice)
|
||||
├── pictus_velum.py # Standard TVA (rechizite)
|
||||
├── unlimited_keys.py # Standard TVA, NUMERAR payment
|
||||
├── best_print.py # Non-VAT payer (neplătitor TVA)
|
||||
├── stepout_market.py # TVA 5% (cărți/librărie)
|
||||
└── README.md # Acest fișier
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Profile existente (12 profile)
|
||||
|
||||
> **Note**: Pattern-urile TVA sunt **flexibile** și acceptă ORICE cotă (5%, 9%, 11%, 19%, 21%, etc.)
|
||||
> pentru a gestiona atât datele istorice cât și schimbările viitoare ale legislației.
|
||||
|
||||
| Magazin | CUI | Fișier | Caracteristici |
|
||||
|---------|-----|--------|----------------|
|
||||
| LIDL DISCOUNT S.R.L. | 22891860 | `lidl.py` | Multi-rate TVA (coduri A, B, C, D) |
|
||||
| OMV PETROM MARKETING S.R.L. | 11201891 | `omv.py` | B2B (client CUI), date YYYY.MM.DD |
|
||||
| SOCAR PETROLEUM S.A. | 12546600 | `socar.py` | B2B (client CUI), date YYYY.MM.DD |
|
||||
| FIVE-HOLDING S.A. (BRICK) | 10562600 | `brick.py` | Standard TVA |
|
||||
| DEDEMAN SRL | 2816464 | `dedeman.py` | E-factura support |
|
||||
| KINETERRA CONCEPT SRL | 31180432 | `kineterra.py` | Non-VAT payer (returnează `[]`) |
|
||||
| GAMA INK SERVICE SRL | 17741882 | `gama_ink.py` | Standard TVA (toner, cartușe) |
|
||||
| ELECTROBERING S.R.L. | 2744937 | `electrobering.py` | Standard TVA (electronice) |
|
||||
| PICTUS VELUM SRL | 39634534 | `pictus_velum.py` | Standard TVA (rechizite) |
|
||||
| UNLIMITED KEYS S.R.L. | 18993187 | `unlimited_keys.py` | Standard TVA, **NUMERAR** plată |
|
||||
| BEST PRINT TRADE ACTIV SRL | 45417955 | `best_print.py` | **Non-VAT payer** (neplătitor TVA) |
|
||||
| STEPOUT MARKET SRL | 35532655 | `stepout_market.py` | TVA 5% (cărți, librărie) |
|
||||
|
||||
---
|
||||
|
||||
## API Endpoints
|
||||
|
||||
| Endpoint | Metodă | Descriere |
|
||||
|----------|--------|-----------|
|
||||
| `/api/data-entry/ocr/profiles` | GET | Lista toate profilele |
|
||||
| `/api/data-entry/ocr/profiles/{cui}` | GET | Detalii profil (acceptă RO prefix) |
|
||||
| `/api/data-entry/ocr/profiles/reload` | POST | Hot-reload toate profilele |
|
||||
|
||||
### Exemple API
|
||||
|
||||
```bash
|
||||
# Lista profile
|
||||
curl http://localhost:8000/api/data-entry/ocr/profiles \
|
||||
-H "Authorization: Bearer <token>"
|
||||
|
||||
# Detalii profil (cu sau fără RO prefix)
|
||||
curl http://localhost:8000/api/data-entry/ocr/profiles/22891860
|
||||
curl http://localhost:8000/api/data-entry/ocr/profiles/RO22891860
|
||||
|
||||
# Hot-reload după modificări
|
||||
curl -X POST http://localhost:8000/api/data-entry/ocr/profiles/reload \
|
||||
-H "Authorization: Bearer <token>"
|
||||
|
||||
# Response reload:
|
||||
{
|
||||
"success": true,
|
||||
"reloaded_modules": 12,
|
||||
"profiles_count": 12,
|
||||
"registered_cuis": ["22891860", "11201891", "12546600", "10562600", ...],
|
||||
"last_reload": "2026-01-06T22:37:05.000000"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cum funcționează sistemul
|
||||
|
||||
### Flow de extracție
|
||||
|
||||
```
|
||||
ReceiptExtractor.extract()
|
||||
│
|
||||
├─► STEP 1: Extrage vendor + CUI
|
||||
│ └─► _extract_vendor(), _extract_cui()
|
||||
│
|
||||
├─► ProfileRegistry.get_profile(cui)
|
||||
│ └─► Returnează profil specific sau None
|
||||
│
|
||||
├─► STEP 2: Extracție cu profil (dacă există)
|
||||
│ ├─► profile.extract_total()
|
||||
│ ├─► profile.extract_date()
|
||||
│ ├─► profile.extract_receipt_number()
|
||||
│ ├─► profile.extract_tva_entries()
|
||||
│ ├─► profile.extract_payment_methods()
|
||||
│ └─► profile.extract_client_cui()
|
||||
│
|
||||
└─► STEP 3-4: Validare + post-procesare
|
||||
```
|
||||
|
||||
### Fallback
|
||||
|
||||
Dacă nu există profil pentru CUI, se folosește logica generică din `ReceiptExtractor`.
|
||||
|
||||
---
|
||||
|
||||
## Structura unui profil
|
||||
|
||||
```python
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
@ProfileRegistry.register
|
||||
class MagazinNouProfile(BaseStoreProfile):
|
||||
"""Docstring cu descriere magazin."""
|
||||
|
||||
CUI_LIST = ["12345678"] # Poate avea mai multe CUI-uri
|
||||
NAME_PATTERNS = ["MAGAZIN", "MAGAZIN NOU", "MAG4ZIN"] # OCR variants
|
||||
STORE_NAME = "Magazin Nou SRL"
|
||||
|
||||
# Override doar ce e diferit de base class
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
# Pattern-uri specifice magazinului
|
||||
...
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": True,
|
||||
"has_client_cui": False,
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pattern-uri disponibile în base.py
|
||||
|
||||
BaseStoreProfile include pattern-uri generice OCR-tolerant:
|
||||
|
||||
| Pattern | Descriere |
|
||||
|---------|-----------|
|
||||
| `TOTAL_PATTERNS` | 8 variante pentru TOTAL (TOTAL:, TOTAL DE PLATA, etc.) |
|
||||
| `DATE_PATTERNS` | 6 variante (DD.MM.YYYY, YYYY-MM-DD, DD/MM/YYYY) |
|
||||
| `DATE_PATTERNS_OCR_SPACES` | 4 variante cu spații OCR ("2025. 08. 14") |
|
||||
| `NUMBER_PATTERNS` | 11 variante pentru număr bon (NDS, BF, C3POS) |
|
||||
| `PAYMENT_PATTERNS` | 8 variante pentru CARD/NUMERAR |
|
||||
| `CLIENT_MARKERS` | 6 variante pentru secțiune CLIENT |
|
||||
| `CLIENT_CUI_PATTERNS` | 7 variante pentru CUI client |
|
||||
|
||||
### Metode implementate în base class
|
||||
|
||||
- `extract_total(text)` → `Tuple[Decimal, float]`
|
||||
- `extract_date(text)` → `Tuple[date, float]`
|
||||
- `extract_receipt_number(text)` → `Tuple[str, float]`
|
||||
- `extract_payment_methods(text)` → `List[dict]`
|
||||
- `extract_client_cui(text)` → `Tuple[str, float]`
|
||||
- `extract_client_name(text)` → `Tuple[str, float]`
|
||||
|
||||
---
|
||||
|
||||
## Când ai nevoie de profil custom?
|
||||
|
||||
| Situație | Exemplu | Ce trebuie override |
|
||||
|----------|---------|---------------------|
|
||||
| **Multi-rate TVA** | Lidl (TVA A, TVA B) | `extract_tva_entries()` |
|
||||
| **Format dată special** | OMV/Socar (YYYY.MM.DD) | `DATE_PATTERNS_OCR_SPACES` |
|
||||
| **B2B receipts** | Benzinării (au client CUI) | `extract_client_cui()` |
|
||||
| **Non-VAT payer** | Kineterra | `extract_tva_entries()` returnează `[]` |
|
||||
| **E-factura** | Dedeman | `extract_efactura_reference()` |
|
||||
|
||||
---
|
||||
|
||||
## Decizii de design
|
||||
|
||||
1. **Hot-reload manual** - endpoint `/profiles/reload` apelat când se modifică fișiere
|
||||
2. **Persistență în Python** - profile în Git, version controlled
|
||||
3. **Fallback graceful** - dacă nu există profil, folosește logica generică
|
||||
4. **CUI normalization** - gestionează automat prefixul "RO" și whitespace
|
||||
5. **Deduplicare TVA** - folosește `seen = set()` pentru a evita duplicate
|
||||
|
||||
---
|
||||
|
||||
## Comenzi utile
|
||||
|
||||
```bash
|
||||
# Verifică syntax Python pentru toate profilele
|
||||
for f in backend/modules/data_entry/services/ocr/profiles/*.py; do
|
||||
python3 -m py_compile "$f" && echo "✓ $(basename $f)"
|
||||
done
|
||||
|
||||
# Lista profile
|
||||
ls -la backend/modules/data_entry/services/ocr/profiles/
|
||||
|
||||
# Pornește backend pentru testare
|
||||
cd backend && source venv/bin/activate
|
||||
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 1
|
||||
|
||||
# Test OCR pe un PDF
|
||||
curl -X POST -F "file=@docs/data-entry/test.pdf" \
|
||||
-H "Authorization: Bearer <token>" \
|
||||
"http://localhost:8000/api/data-entry/ocr/extract?engine=doctr_plus"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Script generare profile
|
||||
|
||||
`scripts/generate_store_profile.py` - generator automat de profile
|
||||
|
||||
```bash
|
||||
# Vezi help
|
||||
python scripts/generate_store_profile.py --help
|
||||
|
||||
# Funcționalități:
|
||||
# - Analizează PDF-uri via OCR API
|
||||
# - Detectează: TVA format, date format, payment patterns, B2B
|
||||
# - Generează cod Python cu OCR error variants
|
||||
# - Suportă glob patterns (*.pdf)
|
||||
# - Verifică sintaxa după generare
|
||||
```
|
||||
388
backend/modules/data_entry/services/ocr/profiles/__init__.py
Normal file
388
backend/modules/data_entry/services/ocr/profiles/__init__.py
Normal file
@@ -0,0 +1,388 @@
|
||||
"""
|
||||
Store Profiles Registry with Hot-Reload Support.
|
||||
|
||||
This module provides a registry for store-specific OCR extraction profiles.
|
||||
Profiles can be reloaded at runtime without restarting the server.
|
||||
|
||||
Usage:
|
||||
from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
|
||||
|
||||
# Get profile for a CUI
|
||||
profile = ProfileRegistry.get_profile("22891860")
|
||||
if profile:
|
||||
tva_entries = profile.extract_tva_entries(text)
|
||||
|
||||
# Reload all profiles (after file changes)
|
||||
count = ProfileRegistry.reload_all()
|
||||
|
||||
Architecture:
|
||||
- ProfileRegistry: Singleton registry with class methods
|
||||
- BaseStoreProfile: Abstract base class for profiles
|
||||
- @ProfileRegistry.register: Decorator for profile classes
|
||||
|
||||
Hot-Reload Mechanism:
|
||||
1. Admin calls POST /profiles/reload endpoint
|
||||
2. Registry clears instance cache
|
||||
3. importlib.reload() re-executes each profile module
|
||||
4. @register decorator re-registers classes with new code
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import importlib
|
||||
import logging
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional, Type, TYPE_CHECKING
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .base import BaseStoreProfile
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Directory containing profile modules
|
||||
PROFILES_DIR = Path(__file__).parent
|
||||
|
||||
|
||||
class ProfileRegistry:
|
||||
"""
|
||||
Registry for store-specific OCR extraction profiles.
|
||||
|
||||
Uses class methods for singleton-like behavior without explicit instantiation.
|
||||
Supports hot-reload via importlib.reload() for runtime updates.
|
||||
|
||||
Attributes:
|
||||
_profiles: Maps CUI -> profile class (not instance)
|
||||
_instances: Maps CUI -> profile instance (lazy, cleared on reload)
|
||||
_last_reload: Timestamp of last reload
|
||||
_loaded: Whether initial load has been performed
|
||||
"""
|
||||
|
||||
# Class-level storage (singleton pattern via class methods)
|
||||
_profiles: Dict[str, Type["BaseStoreProfile"]] = {}
|
||||
_instances: Dict[str, "BaseStoreProfile"] = {}
|
||||
_last_reload: Optional[datetime] = None
|
||||
_loaded: bool = False
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Registration
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
@classmethod
|
||||
def register(cls, profile_class: Type["BaseStoreProfile"]) -> Type["BaseStoreProfile"]:
|
||||
"""
|
||||
Decorator to register a store profile class.
|
||||
|
||||
Registers the profile for all CUIs in the class's CUI_LIST.
|
||||
Safe for re-registration during hot-reload (overwrites existing).
|
||||
|
||||
Usage:
|
||||
@ProfileRegistry.register
|
||||
class LidlProfile(BaseStoreProfile):
|
||||
CUI_LIST = ["22891860"]
|
||||
...
|
||||
|
||||
Args:
|
||||
profile_class: Profile class to register
|
||||
|
||||
Returns:
|
||||
The same class (allows use as decorator)
|
||||
|
||||
Raises:
|
||||
ValueError: If CUI_LIST is empty
|
||||
"""
|
||||
cui_list = getattr(profile_class, 'CUI_LIST', [])
|
||||
store_name = getattr(profile_class, 'STORE_NAME', profile_class.__name__)
|
||||
|
||||
if not cui_list:
|
||||
logger.warning(f"Profile {profile_class.__name__} has empty CUI_LIST, skipping")
|
||||
return profile_class
|
||||
|
||||
# Register for each CUI
|
||||
for cui in cui_list:
|
||||
# Normalize CUI (remove RO prefix, strip whitespace)
|
||||
normalized_cui = cls._normalize_cui(cui)
|
||||
|
||||
if normalized_cui in cls._profiles:
|
||||
old_class = cls._profiles[normalized_cui]
|
||||
logger.debug(
|
||||
f"Re-registering CUI {normalized_cui}: "
|
||||
f"{old_class.__name__} -> {profile_class.__name__}"
|
||||
)
|
||||
# Clear cached instance for this CUI
|
||||
cls._instances.pop(normalized_cui, None)
|
||||
|
||||
cls._profiles[normalized_cui] = profile_class
|
||||
logger.debug(f"Registered profile {profile_class.__name__} for CUI {normalized_cui}")
|
||||
|
||||
logger.info(f"Registered {store_name} for CUIs: {cui_list}")
|
||||
return profile_class
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Lookup
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
@classmethod
|
||||
def get_profile(cls, cui: Optional[str]) -> Optional["BaseStoreProfile"]:
|
||||
"""
|
||||
Get profile instance for a CUI.
|
||||
|
||||
Uses lazy instantiation - creates instance on first access.
|
||||
Returns None if no profile is registered for this CUI.
|
||||
|
||||
Args:
|
||||
cui: CUI to lookup (with or without RO prefix)
|
||||
|
||||
Returns:
|
||||
Profile instance or None
|
||||
"""
|
||||
if not cui:
|
||||
return None
|
||||
|
||||
# Ensure profiles are loaded
|
||||
if not cls._loaded:
|
||||
cls._load_all_profiles()
|
||||
|
||||
normalized_cui = cls._normalize_cui(cui)
|
||||
|
||||
# Check if profile exists
|
||||
profile_class = cls._profiles.get(normalized_cui)
|
||||
if not profile_class:
|
||||
return None
|
||||
|
||||
# Lazy instantiation
|
||||
if normalized_cui not in cls._instances:
|
||||
try:
|
||||
cls._instances[normalized_cui] = profile_class()
|
||||
logger.debug(f"Instantiated {profile_class.__name__} for CUI {normalized_cui}")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to instantiate {profile_class.__name__}: {e}")
|
||||
return None
|
||||
|
||||
return cls._instances[normalized_cui]
|
||||
|
||||
@classmethod
|
||||
def has_profile(cls, cui: Optional[str]) -> bool:
|
||||
"""Check if a profile exists for this CUI."""
|
||||
if not cui:
|
||||
return False
|
||||
if not cls._loaded:
|
||||
cls._load_all_profiles()
|
||||
return cls._normalize_cui(cui) in cls._profiles
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Listing
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
@classmethod
|
||||
def list_profiles(cls) -> List[Dict]:
|
||||
"""
|
||||
List all registered profiles.
|
||||
|
||||
Returns:
|
||||
List of dicts with cui, class_name, store_name, name_patterns
|
||||
"""
|
||||
if not cls._loaded:
|
||||
cls._load_all_profiles()
|
||||
|
||||
result = []
|
||||
seen_classes = set()
|
||||
|
||||
for cui, profile_class in cls._profiles.items():
|
||||
# Avoid duplicates for profiles with multiple CUIs
|
||||
if profile_class.__name__ in seen_classes:
|
||||
continue
|
||||
seen_classes.add(profile_class.__name__)
|
||||
|
||||
result.append({
|
||||
"cuis": list(getattr(profile_class, 'CUI_LIST', [])),
|
||||
"class_name": profile_class.__name__,
|
||||
"store_name": getattr(profile_class, 'STORE_NAME', profile_class.__name__),
|
||||
"name_patterns": list(getattr(profile_class, 'NAME_PATTERNS', [])),
|
||||
})
|
||||
|
||||
return result
|
||||
|
||||
@classmethod
|
||||
def get_profile_info(cls, cui: str) -> Optional[Dict]:
|
||||
"""
|
||||
Get detailed info about a profile.
|
||||
|
||||
Args:
|
||||
cui: CUI to lookup
|
||||
|
||||
Returns:
|
||||
Dict with profile details or None
|
||||
"""
|
||||
profile = cls.get_profile(cui)
|
||||
if not profile:
|
||||
return None
|
||||
|
||||
return {
|
||||
"cui": cui,
|
||||
"cuis": list(profile.CUI_LIST),
|
||||
"class_name": profile.__class__.__name__,
|
||||
"store_name": profile.STORE_NAME,
|
||||
"name_patterns": list(profile.NAME_PATTERNS),
|
||||
"validation_hints": profile.get_validation_hints(),
|
||||
}
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Hot-Reload
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
@classmethod
|
||||
def reload_all(cls) -> int:
|
||||
"""
|
||||
Hot-reload all profile modules.
|
||||
|
||||
Clears instance cache and reloads all .py files in profiles directory.
|
||||
Decorator re-registers classes with updated code.
|
||||
|
||||
Returns:
|
||||
Number of modules reloaded
|
||||
"""
|
||||
logger.info("Starting profile hot-reload...")
|
||||
|
||||
# Clear instance cache (will be recreated on next get_profile)
|
||||
cls._instances.clear()
|
||||
|
||||
# Get list of profile modules (exclude __init__, base)
|
||||
module_names = cls._get_profile_module_names()
|
||||
|
||||
count = 0
|
||||
for module_name in module_names:
|
||||
full_name = f"backend.modules.data_entry.services.ocr.profiles.{module_name}"
|
||||
|
||||
try:
|
||||
if full_name in sys.modules:
|
||||
# Reload existing module
|
||||
importlib.reload(sys.modules[full_name])
|
||||
logger.debug(f"Reloaded module: {module_name}")
|
||||
else:
|
||||
# Import new module
|
||||
importlib.import_module(full_name)
|
||||
logger.debug(f"Imported new module: {module_name}")
|
||||
count += 1
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to reload {module_name}: {e}")
|
||||
|
||||
cls._last_reload = datetime.utcnow()
|
||||
cls._loaded = True
|
||||
|
||||
logger.info(f"Profile hot-reload complete: {count} modules, {len(cls._profiles)} profiles")
|
||||
return count
|
||||
|
||||
@classmethod
|
||||
def get_reload_status(cls) -> Dict:
|
||||
"""Get status of the registry including last reload time."""
|
||||
return {
|
||||
"loaded": cls._loaded,
|
||||
"last_reload": cls._last_reload.isoformat() if cls._last_reload else None,
|
||||
"profiles_count": len(cls._profiles),
|
||||
"instances_count": len(cls._instances),
|
||||
"registered_cuis": list(cls._profiles.keys()),
|
||||
}
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Internal methods
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
@classmethod
|
||||
def _normalize_cui(cls, cui: str) -> str:
|
||||
"""
|
||||
Normalize CUI for consistent lookup.
|
||||
|
||||
- Removes RO prefix (with or without space)
|
||||
- Strips whitespace
|
||||
- Converts to uppercase
|
||||
|
||||
Args:
|
||||
cui: Raw CUI string
|
||||
|
||||
Returns:
|
||||
Normalized CUI (digits only)
|
||||
"""
|
||||
if not cui:
|
||||
return ""
|
||||
|
||||
cui = str(cui).strip().upper()
|
||||
|
||||
# Remove RO prefix (handles "RO12345" and "RO 12345")
|
||||
if cui.startswith("RO"):
|
||||
cui = cui[2:].lstrip()
|
||||
|
||||
return cui.strip()
|
||||
|
||||
@classmethod
|
||||
def _get_profile_module_names(cls) -> List[str]:
|
||||
"""
|
||||
Get list of profile module names from profiles directory.
|
||||
|
||||
Excludes __init__.py and base.py.
|
||||
|
||||
Returns:
|
||||
List of module names (without .py extension)
|
||||
"""
|
||||
excluded = {"__init__", "base", "__pycache__"}
|
||||
modules = []
|
||||
|
||||
for path in PROFILES_DIR.glob("*.py"):
|
||||
name = path.stem
|
||||
if name not in excluded:
|
||||
modules.append(name)
|
||||
|
||||
return sorted(modules)
|
||||
|
||||
@classmethod
|
||||
def _load_all_profiles(cls) -> None:
|
||||
"""
|
||||
Initial load of all profile modules.
|
||||
|
||||
Called automatically on first get_profile() if not already loaded.
|
||||
"""
|
||||
if cls._loaded:
|
||||
return
|
||||
|
||||
logger.info("Loading store profiles...")
|
||||
|
||||
module_names = cls._get_profile_module_names()
|
||||
|
||||
for module_name in module_names:
|
||||
full_name = f"backend.modules.data_entry.services.ocr.profiles.{module_name}"
|
||||
try:
|
||||
importlib.import_module(full_name)
|
||||
logger.debug(f"Loaded module: {module_name}")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to load {module_name}: {e}")
|
||||
|
||||
cls._loaded = True
|
||||
cls._last_reload = datetime.utcnow()
|
||||
|
||||
logger.info(f"Loaded {len(cls._profiles)} store profiles")
|
||||
|
||||
@classmethod
|
||||
def clear(cls) -> None:
|
||||
"""
|
||||
Clear all registered profiles.
|
||||
|
||||
Mainly useful for testing.
|
||||
"""
|
||||
cls._profiles.clear()
|
||||
cls._instances.clear()
|
||||
cls._loaded = False
|
||||
cls._last_reload = None
|
||||
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Module exports
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
__all__ = [
|
||||
"ProfileRegistry",
|
||||
"BaseStoreProfile",
|
||||
]
|
||||
|
||||
# Re-export BaseStoreProfile for convenience
|
||||
from .base import BaseStoreProfile
|
||||
515
backend/modules/data_entry/services/ocr/profiles/base.py
Normal file
515
backend/modules/data_entry/services/ocr/profiles/base.py
Normal file
@@ -0,0 +1,515 @@
|
||||
"""
|
||||
Base class for store-specific OCR extraction profiles.
|
||||
|
||||
Each store can have different receipt formats (TVA layout, total position, etc.).
|
||||
Store profiles allow customizing extraction logic per-store for better accuracy.
|
||||
|
||||
Usage:
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
@ProfileRegistry.register
|
||||
class LidlProfile(BaseStoreProfile):
|
||||
CUI_LIST = ["22891860"]
|
||||
NAME_PATTERNS = ["LIDL", "LDL"]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
# Custom Lidl TVA extraction logic
|
||||
...
|
||||
"""
|
||||
|
||||
import re
|
||||
from abc import ABC
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Optional, Tuple, Dict, Any
|
||||
from datetime import date
|
||||
|
||||
|
||||
class BaseStoreProfile(ABC):
|
||||
"""
|
||||
Abstract base class for store-specific extraction profiles.
|
||||
|
||||
Each profile defines:
|
||||
- CUI_LIST: CUI codes that identify this store (without RO prefix)
|
||||
- NAME_PATTERNS: OCR-tolerant name patterns for fallback matching
|
||||
- Custom extraction methods for TVA, total, date, etc.
|
||||
|
||||
The ProfileRegistry uses CUI_LIST to lookup profiles during extraction.
|
||||
"""
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Class attributes - override in subclasses
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
# List of CUI codes (without RO prefix) that identify this store
|
||||
CUI_LIST: List[str] = []
|
||||
|
||||
# OCR-tolerant name patterns for fallback matching
|
||||
NAME_PATTERNS: List[str] = []
|
||||
|
||||
# Store display name
|
||||
STORE_NAME: str = "Unknown Store"
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Generic patterns - can be overridden in subclasses
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
# Total amount patterns (confidence-weighted)
|
||||
TOTAL_PATTERNS = [
|
||||
(r'T[O0]TAL[.\s]+L[E3][I1!]\s*:?\s*([\d\s.,]+)', 0.98),
|
||||
(r'TOTAL\s+LEI\s*([\d\s.,]+)', 0.98),
|
||||
(r'[OT]?OTAL\s+LEI\s*([\d\s.,]+)', 0.95),
|
||||
(r'TOTAL\s*:?\s*([\d\s.,]+)\s*(?:RON|LEI)?', 0.95),
|
||||
(r'TOTAL\s+(?:RON|LEI)\s*([\d\s.,]+)', 0.95),
|
||||
(r'SUBTOTAL\s*([\d\s.,]+)', 0.90),
|
||||
(r'DE\s+PLATA\s*:?\s*([\d\s.,]+)', 0.90),
|
||||
(r'SUMA\s*:?\s*([\d\s.,]+)', 0.85),
|
||||
]
|
||||
|
||||
# Date patterns (confidence-weighted)
|
||||
DATE_PATTERNS = [
|
||||
(r'D[AR]TA\s*:?\s*(\d{2}[-./]\d{2}[-./]\d{4})', 0.98),
|
||||
(r'DATA\s*:?\s*(\d{2}[-./]\d{2}[-./]\d{4})', 0.98),
|
||||
(r'(\d{2}[-./]\d{2}[-./]\d{4})\s+[O0]RA\s*:?\s*\d{2}:\d{2}', 0.95),
|
||||
(r'(\d{2}[-./]\d{2}[-./]\d{4})\s+\d{2}:\d{2}', 0.90),
|
||||
(r'(\d{2}[-./]\d{2}[-./]\d{4})', 0.80),
|
||||
(r'(\d{4}[-./]\d{2}[-./]\d{2})', 0.75),
|
||||
]
|
||||
|
||||
# Date patterns with OCR-introduced spaces (separate because format is different)
|
||||
DATE_PATTERNS_OCR_SPACES = [
|
||||
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})\s+\d{2}:\d{2}', 0.92, 'ymd'),
|
||||
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})', 0.85, 'ymd'),
|
||||
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})\s+\d{2}:\d{2}', 0.92, 'dmy'),
|
||||
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})', 0.85, 'dmy'),
|
||||
]
|
||||
|
||||
# Receipt number patterns (confidence-weighted)
|
||||
NUMBER_PATTERNS = [
|
||||
(r'NDS\s*:?\s*(\d+)', 0.98),
|
||||
(r'C3POS[-A-Z0-9]*[N:](\d{6,7})', 0.98),
|
||||
(r'C3POS.*?(\d{6,7})\b', 0.95),
|
||||
(r'BF\s*:\s*(\d{4,})', 0.96),
|
||||
(r'BF\s+(\d{4,})', 0.93),
|
||||
(r'NIVS\s*:?\s*(\d+)', 0.95),
|
||||
(r'NR\.?\s*BON\s*:?\s*(\d+)', 0.95),
|
||||
(r'BON\s+(?:FISCAL\s+)?NR\.?\s*:?\s*(\d+)', 0.95),
|
||||
(r'CHITANTA\s+NR\.?\s*:?\s*(\d+)', 0.95),
|
||||
(r'NR\.?\s+DOCUMENT\s*:?\s*(\d+)', 0.90),
|
||||
(r'ID\s*BF\s*:?\s*(\d+)', 0.90),
|
||||
]
|
||||
|
||||
# Payment method patterns (pattern, method_type, confidence)
|
||||
PAYMENT_PATTERNS = [
|
||||
(r'CARTE\s+CREDIT\s*:?\s*([\d\s.,]+)', 'CARD', 0.98),
|
||||
(r'CARTE\s+CREDIT\s*:?\s*\n\s*([\d\s.,]+)', 'CARD', 0.97),
|
||||
(r'(?:PLATA\s+)?CARD\s*[:\sA-Z]?\s*([\d\s.,]+)', 'CARD', 0.95),
|
||||
(r'NUMERAR\s*:?\s*([\d\s.,]+)', 'NUMERAR', 0.95),
|
||||
(r'CASH\s*:?\s*([\d\s.,]+)', 'NUMERAR', 0.90),
|
||||
(r'(?:^|\n|\s)RD\s*:?\s*(\d{1,6}[.,]\d{2})\b', 'CARD', 0.70),
|
||||
(r'(?:^|\n|\s)ARD\s*:?\s*(\d{1,6}[.,]\d{2})\b', 'CARD', 0.75),
|
||||
(r'(?:^|\n|\s)MERAR\s*:?\s*(\d{1,6}[.,]\d{2})\b', 'NUMERAR', 0.70),
|
||||
]
|
||||
|
||||
# Client section markers (for B2B receipts)
|
||||
CLIENT_MARKERS = [
|
||||
r'C\.?\s*[I1]\.?\s*F\.?\s+CLIENT\s*:',
|
||||
r'C\.?\s*U\.?\s*[I1]\.?\s+CLIENT\s*:',
|
||||
r'CLIENT\s+C\.?\s*[UI1]\.?\s*[IF1]\.?\s*:',
|
||||
r'CLIENT\s*:',
|
||||
r'CUMPARATOR\s*:',
|
||||
r'BENEFICIAR\s*:',
|
||||
]
|
||||
|
||||
# Client CUI patterns (pattern, confidence)
|
||||
CLIENT_CUI_PATTERNS = [
|
||||
(r'(R[O0]\d{6,10})\s*\n\s*CLIENT\s+C\.?\s*U\.?\s*[I1]\.?', 0.99),
|
||||
(r'(R[O0]\d{6,10})\s*:?\s*\n\s*CLIENT', 0.98),
|
||||
(r'C[I1]F\s+[A-Z]*\s*CLIENT\s*:?\s*(R[O0]\d{6,10})', 0.98),
|
||||
(r'C\.?\s*[I1]\.?\s*F\.?\s+CLIENT\s*:?\s*(R[O0]?\d{6,10})', 0.98),
|
||||
(r'C\.?\s*U\.?\s*[I1]\.?\s+CLIENT\s*:?\s*(R[O0]?\d{6,10})', 0.98),
|
||||
(r'CLIENT\s+C\.?\s*U\.?\s*[I1]\.?\s*:?\s*(R[O0]?\d{6,10})', 0.95),
|
||||
(r'CLIENT\s*:?\s*(R[O0]?\d{6,10})', 0.90),
|
||||
]
|
||||
|
||||
# Company type indicators (for identifying company names)
|
||||
COMPANY_INDICATORS = [
|
||||
r'\bS\.?\s*R\.?\s*L\.?\b', # S.R.L. or S. R. L.
|
||||
r'\bS\.?\s*A\.?\b', # S.A. or S. A.
|
||||
r'\bS\.?\s*N\.?\s*C\.?\b', # S.N.C. or S. N. C.
|
||||
r'\bS\.?\s*C\.?\s*S\.?\b', # S.C.S. or S. C. S.
|
||||
r'\bI\.?\s*I\.?\b', # I.I. or I. I.
|
||||
r'\bP\.?\s*F\.?\s*A\.?\b', # P.F.A. or P. F. A.
|
||||
r'\bS\.?\s*C\.?\s+[A-Z]', # S.C. followed by company name
|
||||
r'HOLDING',
|
||||
r'COMPANY',
|
||||
r'GROUP',
|
||||
]
|
||||
|
||||
# Maximum reasonable payment amount (to filter OCR errors)
|
||||
MAX_PAYMENT = Decimal('100000')
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Extraction methods - override in subclasses as needed
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract TVA entries from receipt text.
|
||||
|
||||
Override this method in subclasses to handle store-specific TVA formats.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of dicts with keys: code, percent, amount
|
||||
"""
|
||||
return []
|
||||
|
||||
def extract_total(self, text: str) -> Tuple[Optional[Decimal], float]:
|
||||
"""
|
||||
Extract total amount from receipt text.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
Tuple of (amount, confidence) or (None, 0.0)
|
||||
"""
|
||||
text_upper = text.upper()
|
||||
|
||||
for pattern, confidence in self.TOTAL_PATTERNS:
|
||||
match = re.search(pattern, text_upper)
|
||||
if match:
|
||||
amount = self._parse_decimal(match.group(1))
|
||||
if amount and amount > 0 and amount < self.MAX_PAYMENT:
|
||||
return (amount, confidence)
|
||||
|
||||
return (None, 0.0)
|
||||
|
||||
def extract_date(self, text: str) -> Tuple[Optional[date], float]:
|
||||
"""
|
||||
Extract receipt date from text.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
Tuple of (date, confidence) or (None, 0.0)
|
||||
"""
|
||||
text_upper = text.upper()
|
||||
|
||||
# Try standard patterns first
|
||||
for pattern, confidence in self.DATE_PATTERNS:
|
||||
match = re.search(pattern, text_upper)
|
||||
if match:
|
||||
parsed = self._parse_date(match.group(1))
|
||||
if parsed:
|
||||
return (parsed, confidence)
|
||||
|
||||
# Try OCR-corrupted patterns with spaces
|
||||
for pattern, confidence, fmt in self.DATE_PATTERNS_OCR_SPACES:
|
||||
match = re.search(pattern, text_upper)
|
||||
if match:
|
||||
try:
|
||||
if fmt == 'ymd':
|
||||
year, month, day = int(match.group(1)), int(match.group(2)), int(match.group(3))
|
||||
else: # dmy
|
||||
day, month, year = int(match.group(1)), int(match.group(2)), int(match.group(3))
|
||||
|
||||
if 1 <= day <= 31 and 1 <= month <= 12 and 2000 <= year <= 2100:
|
||||
return (date(year, month, day), confidence)
|
||||
except (ValueError, TypeError):
|
||||
continue
|
||||
|
||||
return (None, 0.0)
|
||||
|
||||
def extract_receipt_number(self, text: str) -> Tuple[Optional[str], float]:
|
||||
"""
|
||||
Extract receipt number from text.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
Tuple of (number, confidence) or (None, 0.0)
|
||||
"""
|
||||
text_upper = text.upper()
|
||||
|
||||
for pattern, confidence in self.NUMBER_PATTERNS:
|
||||
match = re.search(pattern, text_upper)
|
||||
if match:
|
||||
number = match.group(1).strip()
|
||||
if number and len(number) >= 3:
|
||||
return (number, confidence)
|
||||
|
||||
return (None, 0.0)
|
||||
|
||||
def extract_payment_methods(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract payment methods (CARD/NUMERAR) from receipt.
|
||||
|
||||
Supports multiple payments of the same type (e.g., 2x CARD for split payments).
|
||||
Each payment is returned as a separate entry with its amount.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of dicts: [{'method': 'CARD'/'NUMERAR', 'amount': Decimal, 'confidence': float}]
|
||||
Multiple entries of same method type are allowed for split payments.
|
||||
"""
|
||||
text_upper = text.upper()
|
||||
methods = []
|
||||
# Track (method, amount) pairs to avoid exact duplicates from overlapping patterns
|
||||
seen_entries = set()
|
||||
|
||||
for pattern, method, confidence in self.PAYMENT_PATTERNS:
|
||||
for match in re.finditer(pattern, text_upper):
|
||||
try:
|
||||
amount = self._parse_decimal(match.group(1))
|
||||
if amount and amount > 0 and amount < self.MAX_PAYMENT:
|
||||
# Deduplicate by (method, amount) to avoid same entry from multiple patterns
|
||||
# But allow different amounts for same method (split payments)
|
||||
entry_key = (method, amount)
|
||||
if entry_key not in seen_entries:
|
||||
methods.append({
|
||||
'method': method,
|
||||
'amount': amount,
|
||||
'confidence': confidence
|
||||
})
|
||||
seen_entries.add(entry_key)
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return methods
|
||||
|
||||
def extract_client_cui(self, text: str) -> Tuple[Optional[str], float]:
|
||||
"""
|
||||
Extract client CUI from B2B receipts.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
Tuple of (cui, confidence) or (None, 0.0)
|
||||
"""
|
||||
text_upper = text.upper()
|
||||
|
||||
# First check if there's a CLIENT section
|
||||
has_client_section = any(
|
||||
re.search(marker, text_upper, re.IGNORECASE)
|
||||
for marker in self.CLIENT_MARKERS
|
||||
)
|
||||
|
||||
if not has_client_section:
|
||||
return (None, 0.0)
|
||||
|
||||
# Try to extract CUI
|
||||
for pattern, confidence in self.CLIENT_CUI_PATTERNS:
|
||||
match = re.search(pattern, text_upper, re.IGNORECASE | re.MULTILINE)
|
||||
if match:
|
||||
cui = match.group(1)
|
||||
# Normalize: remove RO prefix for storage
|
||||
cui_digits = re.sub(r'[^0-9]', '', cui)
|
||||
if 6 <= len(cui_digits) <= 10:
|
||||
return (cui_digits, confidence)
|
||||
|
||||
return (None, 0.0)
|
||||
|
||||
def extract_client_name(self, text: str) -> Tuple[Optional[str], float]:
|
||||
"""
|
||||
Extract client/buyer company name from B2B receipts.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
Tuple of (client_name, confidence) or (None, 0.0)
|
||||
"""
|
||||
text_upper = text.upper()
|
||||
lines = text.split('\n')
|
||||
|
||||
# First check if there's a CLIENT section
|
||||
client_section_idx = None
|
||||
for i, line in enumerate(lines):
|
||||
line_upper = line.upper().strip()
|
||||
if any(re.search(marker, line_upper, re.IGNORECASE) for marker in self.CLIENT_MARKERS):
|
||||
client_section_idx = i
|
||||
break
|
||||
|
||||
if client_section_idx is None:
|
||||
return (None, 0.0)
|
||||
|
||||
# Look for company name in CLIENT section
|
||||
line = lines[client_section_idx].strip()
|
||||
line_upper = line.upper()
|
||||
|
||||
# Strategy 1: Check if name is on same line after ":"
|
||||
if ':' in line:
|
||||
name_part = line.split(':', 1)[1].strip()
|
||||
if name_part and len(name_part) >= 3:
|
||||
# Skip if it looks like a CUI (RO followed by digits)
|
||||
if re.match(r'^R[O0]?\d{6,10}$', name_part.upper()):
|
||||
pass # This is CUI, not name - continue to next strategy
|
||||
else:
|
||||
# Check for company indicators
|
||||
name_upper = name_part.upper()
|
||||
if any(re.search(ind, name_upper) for ind in self.COMPANY_INDICATORS):
|
||||
return (self._clean_company_name(name_part), 0.95)
|
||||
elif len(name_part) >= 5 and not name_part.isdigit():
|
||||
return (self._clean_company_name(name_part), 0.80)
|
||||
|
||||
# Strategy 2: Check next line for company name
|
||||
if client_section_idx + 1 < len(lines):
|
||||
next_line = lines[client_section_idx + 1].strip()
|
||||
next_upper = next_line.upper()
|
||||
|
||||
# Skip if it's a CUI/CIF line or looks like CUI
|
||||
if not re.search(r'C\.?\s*[UI]\.?\s*[IF]\.?', next_upper):
|
||||
if not re.match(r'^R[O0]?\d{6,10}$', next_upper):
|
||||
if any(re.search(ind, next_upper) for ind in self.COMPANY_INDICATORS):
|
||||
return (self._clean_company_name(next_line), 0.90)
|
||||
elif len(next_line) >= 5 and not next_line.isdigit():
|
||||
# Check it's not CUI/CIF/COD keywords
|
||||
if not any(kw in next_upper for kw in ['CUI', 'CIF', 'COD', 'FISCAL']):
|
||||
return (self._clean_company_name(next_line), 0.75)
|
||||
|
||||
# Strategy 3: Look for any line with company indicators in CLIENT section region
|
||||
search_end = min(client_section_idx + 5, len(lines))
|
||||
for i in range(client_section_idx + 1, search_end):
|
||||
line = lines[i].strip()
|
||||
line_upper = line.upper()
|
||||
|
||||
# Skip CUI/CIF lines
|
||||
if re.search(r'C\.?\s*[UI]\.?\s*[IF]\.?', line_upper):
|
||||
continue
|
||||
if re.match(r'^R[O0]?\d{6,10}$', line_upper):
|
||||
continue
|
||||
|
||||
if any(re.search(ind, line_upper) for ind in self.COMPANY_INDICATORS):
|
||||
return (self._clean_company_name(line), 0.85)
|
||||
|
||||
return (None, 0.0)
|
||||
|
||||
@staticmethod
|
||||
def _clean_company_name(name: str) -> str:
|
||||
"""Clean company name for storage."""
|
||||
if not name:
|
||||
return ""
|
||||
# Remove extra whitespace
|
||||
name = re.sub(r'\s+', ' ', name).strip()
|
||||
# Remove trailing punctuation except periods in S.R.L., S.A., etc.
|
||||
name = re.sub(r'[,;:]+$', '', name).strip()
|
||||
return name
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Validation hints - override to customize validation behavior
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""
|
||||
Return validation hints for this store.
|
||||
|
||||
Returns:
|
||||
Dict with validation hints. Common keys:
|
||||
- has_multi_rate_tva: bool - Store uses multiple TVA rates
|
||||
- card_equals_total: bool - CARD payment equals total
|
||||
- has_client_cui: bool - Receipt includes client CUI
|
||||
- has_efactura: bool - Store uses e-factura format
|
||||
- is_non_vat_payer: bool - Store is not a VAT payer
|
||||
"""
|
||||
return {}
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Helper methods - available to all subclasses
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
@staticmethod
|
||||
def _normalize_number(text: str) -> str:
|
||||
"""
|
||||
Normalize a number string for Decimal conversion.
|
||||
|
||||
Handles Romanian formats: "1.234,56" -> "1234.56"
|
||||
"""
|
||||
if not text:
|
||||
return "0"
|
||||
|
||||
# Remove spaces
|
||||
text = text.replace(" ", "")
|
||||
|
||||
# Determine decimal separator
|
||||
last_comma = text.rfind(",")
|
||||
last_dot = text.rfind(".")
|
||||
|
||||
if last_comma > last_dot:
|
||||
text = text.replace(".", "").replace(",", ".")
|
||||
elif last_dot > last_comma:
|
||||
text = text.replace(",", "")
|
||||
else:
|
||||
text = text.replace(",", ".")
|
||||
|
||||
return text
|
||||
|
||||
@staticmethod
|
||||
def _parse_decimal(text: str) -> Optional[Decimal]:
|
||||
"""Parse a string to Decimal, handling various formats."""
|
||||
try:
|
||||
normalized = BaseStoreProfile._normalize_number(text)
|
||||
return Decimal(normalized)
|
||||
except (InvalidOperation, ValueError, TypeError):
|
||||
return None
|
||||
|
||||
@staticmethod
|
||||
def _parse_date(text: str) -> Optional[date]:
|
||||
"""
|
||||
Parse date string in various formats.
|
||||
|
||||
Supports: DD-MM-YYYY, DD/MM/YYYY, DD.MM.YYYY, YYYY-MM-DD
|
||||
"""
|
||||
if not text:
|
||||
return None
|
||||
|
||||
# Normalize separators
|
||||
text = text.replace('/', '-').replace('.', '-')
|
||||
|
||||
try:
|
||||
parts = text.split('-')
|
||||
if len(parts) != 3:
|
||||
return None
|
||||
|
||||
# Determine format based on first part length
|
||||
if len(parts[0]) == 4:
|
||||
# YYYY-MM-DD
|
||||
year, month, day = int(parts[0]), int(parts[1]), int(parts[2])
|
||||
else:
|
||||
# DD-MM-YYYY
|
||||
day, month, year = int(parts[0]), int(parts[1]), int(parts[2])
|
||||
|
||||
# Validate ranges
|
||||
if 1 <= day <= 31 and 1 <= month <= 12 and 2000 <= year <= 2100:
|
||||
return date(year, month, day)
|
||||
except (ValueError, TypeError, IndexError):
|
||||
pass
|
||||
|
||||
return None
|
||||
|
||||
@staticmethod
|
||||
def _clean_text(text: str) -> str:
|
||||
"""Clean OCR text for pattern matching."""
|
||||
if not text:
|
||||
return ""
|
||||
text = re.sub(r'\s+', ' ', text)
|
||||
text = re.sub(r'[\x00-\x09\x0b\x0c\x0e-\x1f\x7f]', '', text)
|
||||
return text.strip()
|
||||
|
||||
# -------------------------------------------------------------------------
|
||||
# Magic methods
|
||||
# -------------------------------------------------------------------------
|
||||
|
||||
def __repr__(self) -> str:
|
||||
return f"<{self.__class__.__name__} CUI={self.CUI_LIST}>"
|
||||
|
||||
def __str__(self) -> str:
|
||||
return f"{self.STORE_NAME} ({', '.join(self.CUI_LIST)})"
|
||||
@@ -0,0 +1,54 @@
|
||||
"""
|
||||
BEST PRINT TRADE ACTIV SRL store profile for OCR extraction.
|
||||
|
||||
Stamp manufacturing service. Non-VAT payer (neplătitor de TVA).
|
||||
"""
|
||||
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class BestPrintProfile(BaseStoreProfile):
|
||||
"""
|
||||
BEST PRINT TRADE ACTIV SRL - non-VAT payer profile.
|
||||
|
||||
Key characteristics:
|
||||
- Non-VAT payer (neplătitor de TVA) - NO TVA on receipts
|
||||
- Stamp manufacturing and printing services
|
||||
- Total amount has no TVA component
|
||||
- CARD payment typical
|
||||
"""
|
||||
|
||||
CUI_LIST = ["45417955"]
|
||||
NAME_PATTERNS = ["BEST PRINT", "BESTPRINT", "BEST PRINT TRADE", "BEST PR1NT"]
|
||||
STORE_NAME = "BEST PRINT TRADE ACTIV SRL"
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract TVA entries - returns empty for non-VAT payer.
|
||||
|
||||
BEST PRINT is a non-VAT payer (neplătitor de TVA),
|
||||
so no TVA entries are expected on receipts.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt (unused)
|
||||
|
||||
Returns:
|
||||
Empty list (non-VAT payer has no TVA)
|
||||
"""
|
||||
# Non-VAT payer - no TVA entries
|
||||
return []
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return BEST PRINT-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": True,
|
||||
"has_client_cui": True, # May have client CUI
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": True, # CRITICAL: Non-VAT payer
|
||||
"tva_pattern": "none",
|
||||
}
|
||||
101
backend/modules/data_entry/services/ocr/profiles/brick.py
Normal file
101
backend/modules/data_entry/services/ocr/profiles/brick.py
Normal file
@@ -0,0 +1,101 @@
|
||||
"""
|
||||
BRICK (Five-Holding) store profile for OCR extraction.
|
||||
|
||||
Five-Holding S.A. operates BRICK stores with standard receipt format.
|
||||
"""
|
||||
|
||||
import re
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class BrickProfile(BaseStoreProfile):
|
||||
"""
|
||||
FIVE-HOLDING S.A. (BRICK) - standard TVA format.
|
||||
|
||||
Key characteristics:
|
||||
- Standard TVA format
|
||||
- Single TVA rate typically
|
||||
- No client CUI on receipts
|
||||
"""
|
||||
|
||||
CUI_LIST = ["10562600"]
|
||||
NAME_PATTERNS = ["BRICK", "FIVE-HOLDING", "FIVE HOLDING", "BR1CK"] # OCR variants
|
||||
STORE_NAME = "FIVE-HOLDING S.A."
|
||||
|
||||
# Standard TVA patterns (flexible - accepts any rate)
|
||||
TVA_PATTERNS = [
|
||||
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
|
||||
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "A - XX,XX% = YY,YY"
|
||||
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# Simple: "TVA XX% YY,YY"
|
||||
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||
]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract BRICK-specific TVA entries.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set()
|
||||
|
||||
# Try coded patterns first
|
||||
for pattern in self.TVA_PATTERNS[:2]:
|
||||
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
amount = self._parse_decimal(match.group(3))
|
||||
|
||||
if amount and amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation, IndexError):
|
||||
continue
|
||||
|
||||
# Fallback to simple format
|
||||
if not entries:
|
||||
simple_pattern = self.TVA_PATTERNS[2]
|
||||
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
percent = int(match.group(1))
|
||||
amount = self._parse_decimal(match.group(2))
|
||||
|
||||
if amount and amount > 0:
|
||||
entries.append({
|
||||
'code': 'A',
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
break
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return entries
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return BRICK-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": False,
|
||||
"has_client_cui": False,
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
}
|
||||
118
backend/modules/data_entry/services/ocr/profiles/dedeman.py
Normal file
118
backend/modules/data_entry/services/ocr/profiles/dedeman.py
Normal file
@@ -0,0 +1,118 @@
|
||||
"""
|
||||
DEDEMAN store profile for OCR extraction.
|
||||
|
||||
Dedeman receipts may include e-factura information and use standard TVA format.
|
||||
Large DIY retailer in Romania.
|
||||
"""
|
||||
|
||||
import re
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class DedemanProfile(BaseStoreProfile):
|
||||
"""
|
||||
DEDEMAN SRL - standard TVA with e-factura support.
|
||||
|
||||
Key characteristics:
|
||||
- Standard TVA format
|
||||
- May include e-factura reference number
|
||||
- Professional receipts for construction materials
|
||||
"""
|
||||
|
||||
CUI_LIST = ["2816464"]
|
||||
NAME_PATTERNS = ["DEDEMAN", "DEDEMAN SRL", "OEDEMAN", "D3DEMAN"] # OCR variants
|
||||
STORE_NAME = "DEDEMAN SRL"
|
||||
|
||||
# Standard TVA patterns (flexible - accepts any rate)
|
||||
TVA_PATTERNS = [
|
||||
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
|
||||
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "A - XX,XX% = YY,YY"
|
||||
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "TVA (XX%) YY,YY"
|
||||
r'TVA\s*\(?\s*(\d{1,2})\s*%\s*\)?\s*:?\s*([\d.,]+)',
|
||||
]
|
||||
|
||||
# E-factura pattern for reference extraction
|
||||
EFACTURA_PATTERN = r'e-?factura\s*:?\s*([A-Z0-9]+)'
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract Dedeman-specific TVA entries.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set()
|
||||
|
||||
# Try coded patterns first
|
||||
for pattern in self.TVA_PATTERNS[:2]:
|
||||
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
amount = self._parse_decimal(match.group(3))
|
||||
|
||||
if amount and amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation, IndexError):
|
||||
continue
|
||||
|
||||
# Fallback to simple format
|
||||
if not entries:
|
||||
simple_pattern = self.TVA_PATTERNS[2]
|
||||
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
percent = int(match.group(1))
|
||||
amount = self._parse_decimal(match.group(2))
|
||||
|
||||
if amount and amount > 0:
|
||||
entries.append({
|
||||
'code': 'A',
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
break
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return entries
|
||||
|
||||
def extract_efactura_reference(self, text: str) -> str | None:
|
||||
"""
|
||||
Extract e-factura reference number if present.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
E-factura reference string or None
|
||||
"""
|
||||
match = re.search(self.EFACTURA_PATTERN, text, re.IGNORECASE)
|
||||
return match.group(1) if match else None
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return Dedeman-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": False,
|
||||
"has_client_cui": False,
|
||||
"has_efactura": True,
|
||||
"is_non_vat_payer": False,
|
||||
}
|
||||
@@ -0,0 +1,102 @@
|
||||
"""
|
||||
ELECTROBERING S.R.L. store profile for OCR extraction.
|
||||
|
||||
Electronics and home supplies store.
|
||||
"""
|
||||
|
||||
import re
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class ElectroberingProfile(BaseStoreProfile):
|
||||
"""
|
||||
ELECTROBERING S.R.L. - standard TVA profile.
|
||||
|
||||
Key characteristics:
|
||||
- Standard TVA format (single rate, any percentage)
|
||||
- Electronics and home supplies
|
||||
- May have client CUI for B2B purchases
|
||||
- CARD payment typical
|
||||
"""
|
||||
|
||||
CUI_LIST = ["2744937"]
|
||||
NAME_PATTERNS = ["ELECTROBERING", "ELECTR0BERING", "ELECTROBERING SRL"]
|
||||
STORE_NAME = "ELECTROBERING S.R.L."
|
||||
|
||||
# Standard TVA patterns (flexible - accepts any rate)
|
||||
TVA_PATTERNS = [
|
||||
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
|
||||
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "A - XX,XX% = YY,YY"
|
||||
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "TVA XX% YY,YY" (simple format without code)
|
||||
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||
]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract TVA entries from receipt text.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set()
|
||||
|
||||
# Try coded patterns first
|
||||
for pattern in self.TVA_PATTERNS[:2]:
|
||||
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
amount = self._parse_decimal(match.group(3))
|
||||
|
||||
if amount and amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation, IndexError):
|
||||
continue
|
||||
|
||||
# Fallback to simple format
|
||||
if not entries:
|
||||
simple_pattern = self.TVA_PATTERNS[2]
|
||||
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
percent = int(match.group(1))
|
||||
amount = self._parse_decimal(match.group(2))
|
||||
|
||||
if amount and amount > 0:
|
||||
entries.append({
|
||||
'code': 'A',
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
break
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return entries
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return ELECTROBERING-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": True,
|
||||
"has_client_cui": True, # May have client CUI for B2B
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
}
|
||||
103
backend/modules/data_entry/services/ocr/profiles/gama_ink.py
Normal file
103
backend/modules/data_entry/services/ocr/profiles/gama_ink.py
Normal file
@@ -0,0 +1,103 @@
|
||||
"""
|
||||
GAMA INK SERVICE SRL store profile for OCR extraction.
|
||||
|
||||
Toner refill and printer supplies store.
|
||||
"""
|
||||
|
||||
import re
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class GamaInkProfile(BaseStoreProfile):
|
||||
"""
|
||||
GAMA INK SERVICE SRL - standard TVA profile.
|
||||
|
||||
Key characteristics:
|
||||
- Standard TVA format (single rate, any percentage)
|
||||
- Service-based (toner refill, printer supplies)
|
||||
- CARD payment typical
|
||||
"""
|
||||
|
||||
CUI_LIST = ["17741882"]
|
||||
NAME_PATTERNS = ["GAMA INK", "GAMA", "GAMAINK", "GAMA INK SERVICE"]
|
||||
STORE_NAME = "GAMA INK SERVICE SRL"
|
||||
|
||||
# Standard TVA patterns (flexible - accepts any rate)
|
||||
TVA_PATTERNS = [
|
||||
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
|
||||
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "A - XX,XX% = YY,YY"
|
||||
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "TVA XX% YY,YY" (simple format without code)
|
||||
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||
# "TVA: YY,YY" (amount only, percent inferred)
|
||||
r'TVA\s*:?\s*([\d.,]+)\s*(?:LEI|RON)?',
|
||||
]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract TVA entries from receipt text.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set()
|
||||
|
||||
# Try coded patterns first (have both code and percent)
|
||||
for pattern in self.TVA_PATTERNS[:2]:
|
||||
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
amount = self._parse_decimal(match.group(3))
|
||||
|
||||
if amount and amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation, IndexError):
|
||||
continue
|
||||
|
||||
# Fallback to simple format (percent + amount without code)
|
||||
if not entries:
|
||||
simple_pattern = self.TVA_PATTERNS[2]
|
||||
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
percent = int(match.group(1))
|
||||
amount = self._parse_decimal(match.group(2))
|
||||
|
||||
if amount and amount > 0:
|
||||
entries.append({
|
||||
'code': 'A',
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
break
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return entries
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return GAMA INK-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": True,
|
||||
"has_client_cui": False,
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
}
|
||||
@@ -0,0 +1,53 @@
|
||||
"""
|
||||
KINETERRA store profile for OCR extraction.
|
||||
|
||||
Kineterra is a non-VAT payer (neplătitor de TVA).
|
||||
Receipts don't include TVA breakdown.
|
||||
"""
|
||||
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class KineterraProfile(BaseStoreProfile):
|
||||
"""
|
||||
KINETERRA CONCEPT SRL - non-VAT payer profile.
|
||||
|
||||
Key characteristics:
|
||||
- Non-VAT payer (neplătitor de TVA)
|
||||
- No TVA breakdown on receipts
|
||||
- Total amount has no TVA component
|
||||
"""
|
||||
|
||||
CUI_LIST = ["31180432"]
|
||||
NAME_PATTERNS = ["KINETERRA", "KINETERRA CONCEPT", "K1NETERRA"] # OCR variants
|
||||
STORE_NAME = "KINETERRA CONCEPT SRL"
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract TVA entries - returns empty for non-VAT payer.
|
||||
|
||||
Kineterra is a non-VAT payer, so no TVA entries are expected.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt (unused)
|
||||
|
||||
Returns:
|
||||
Empty list (non-VAT payer has no TVA)
|
||||
"""
|
||||
# Non-VAT payer - no TVA entries
|
||||
return []
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return Kineterra-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": False,
|
||||
"has_client_cui": False,
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": True,
|
||||
"tva_pattern": "none",
|
||||
}
|
||||
93
backend/modules/data_entry/services/ocr/profiles/lidl.py
Normal file
93
backend/modules/data_entry/services/ocr/profiles/lidl.py
Normal file
@@ -0,0 +1,93 @@
|
||||
"""
|
||||
LIDL store profile for OCR extraction.
|
||||
|
||||
Lidl receipts have a specific TVA format without hyphen/colon separators:
|
||||
TOTAL TVA 9,84
|
||||
TVA A 21,00% 7,71
|
||||
TVA B 11,00% 2,13
|
||||
|
||||
This profile handles multi-rate TVA extraction for Lidl receipts.
|
||||
"""
|
||||
|
||||
import re
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class LidlProfile(BaseStoreProfile):
|
||||
"""
|
||||
LIDL DISCOUNT S.R.L. - multi-rate TVA profile.
|
||||
|
||||
Key characteristics:
|
||||
- Multi-rate TVA (codes A, B, C, D with any percentage - patterns are flexible)
|
||||
- TVA format: "TVA A XX,XX% YY,YY" (code + percent + amount on same line)
|
||||
- Supports historical rates (19%, 9%, 5%) and current rates (21%, 11%)
|
||||
- CARD payment usually equals total
|
||||
- No client CUI on receipts
|
||||
"""
|
||||
|
||||
CUI_LIST = ["22891860"]
|
||||
NAME_PATTERNS = ["LIDL", "LDL", "L1DL", "LIDL DISCOUNT"] # OCR variants
|
||||
STORE_NAME = "LIDL DISCOUNT S.R.L."
|
||||
|
||||
# Lidl-specific TVA patterns
|
||||
# Format: "TVA A 21,00% 7,71" (code + percent + amount on same line)
|
||||
TVA_PATTERNS = [
|
||||
# Primary: "TVA A 21,00% 7.71" with various spacing
|
||||
r'T[VU][AR]\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
|
||||
# With backslash OCR artifact: "TVA A \21,00% 7.71"
|
||||
r'T[VU][AR]\s+([A-D])\s+\\?(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
|
||||
# IVA variant (rare OCR misread)
|
||||
r'IVA\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
|
||||
]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract Lidl-specific TVA entries.
|
||||
|
||||
Handles multiple TVA rates (A, B, C, D) commonly found on Lidl receipts.
|
||||
Uses deduplication to avoid counting the same entry twice from different patterns.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set() # Deduplication key: (code, percent)
|
||||
|
||||
for pattern in self.TVA_PATTERNS:
|
||||
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
amount = self._parse_decimal(match.group(3))
|
||||
|
||||
if amount and amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return entries
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return Lidl-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": True,
|
||||
"card_equals_total": True,
|
||||
"has_client_cui": False,
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
}
|
||||
99
backend/modules/data_entry/services/ocr/profiles/omv.py
Normal file
99
backend/modules/data_entry/services/ocr/profiles/omv.py
Normal file
@@ -0,0 +1,99 @@
|
||||
"""
|
||||
OMV Petrom store profile for OCR extraction.
|
||||
|
||||
OMV receipts typically include client CUI and use standard TVA format.
|
||||
Common at gas stations with fuel purchases.
|
||||
|
||||
Date format: YYYY. MM. DD with spaces (e.g., "2025. 08. 14")
|
||||
"""
|
||||
|
||||
import re
|
||||
from datetime import date
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any, Tuple, Optional
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class OMVProfile(BaseStoreProfile):
|
||||
"""
|
||||
OMV PETROM MARKETING S.R.L. - standard TVA with client CUI.
|
||||
|
||||
Key characteristics:
|
||||
- Standard TVA format (usually single rate, any percentage)
|
||||
- Includes client CUI on receipt (for business purchases)
|
||||
- TVA table format: "A-XX,XX% base_amount tva_amount"
|
||||
- Supports historical rates (19%) and current rates (21%)
|
||||
- Date format: YYYY. MM. DD (with spaces)
|
||||
"""
|
||||
|
||||
CUI_LIST = ["11201891"]
|
||||
NAME_PATTERNS = ["OMV", "PETROM", "OMV PETROM", "0MV"] # OCR variants
|
||||
STORE_NAME = "OMV PETROM MARKETING S.R.L."
|
||||
|
||||
# OMV TVA table pattern: "A-19,00% 285,66 49,58" (code-percent base tva)
|
||||
TVA_TABLE_PATTERN = r'([A-D])\s*[-:]\s*(\d{1,2})[.,]\d{2}\s*%\s+([\d.,]+)\s+([\d.,]+)'
|
||||
|
||||
# Standard TVA pattern fallback
|
||||
TVA_STANDARD_PATTERN = r'TVA\s*:?\s*([\d.,]+)'
|
||||
|
||||
# OMV specific: prioritize YYYY. MM. DD format with spaces
|
||||
DATE_PATTERNS_OCR_SPACES = [
|
||||
# YYYY. MM. DD with time (OMV format)
|
||||
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})\s+\d{2}:\d{2}', 0.98, 'ymd'),
|
||||
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})', 0.95, 'ymd'),
|
||||
# Fallback to DD. MM. YYYY
|
||||
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})\s+\d{2}:\d{2}', 0.92, 'dmy'),
|
||||
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})', 0.85, 'dmy'),
|
||||
]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract OMV-specific TVA entries.
|
||||
|
||||
OMV receipts often show TVA in table format with base and TVA amounts.
|
||||
Falls back to standard extraction if table format not found.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set()
|
||||
|
||||
# Try table format first (more accurate)
|
||||
for match in re.finditer(self.TVA_TABLE_PATTERN, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
# TVA amount is the second number (smaller one)
|
||||
tva_amount = self._parse_decimal(match.group(4))
|
||||
|
||||
if tva_amount and tva_amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': tva_amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return entries
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return OMV-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": False,
|
||||
"has_client_cui": True,
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
"tva_table_format": True,
|
||||
}
|
||||
101
backend/modules/data_entry/services/ocr/profiles/pictus_velum.py
Normal file
101
backend/modules/data_entry/services/ocr/profiles/pictus_velum.py
Normal file
@@ -0,0 +1,101 @@
|
||||
"""
|
||||
PICTUS VELUM SRL store profile for OCR extraction.
|
||||
|
||||
Office supplies and stationery store.
|
||||
"""
|
||||
|
||||
import re
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class PictusVelumProfile(BaseStoreProfile):
|
||||
"""
|
||||
PICTUS VELUM SRL - standard TVA profile.
|
||||
|
||||
Key characteristics:
|
||||
- Standard TVA format (single rate, any percentage)
|
||||
- Office supplies and stationery (rechizite)
|
||||
- CARD payment typical
|
||||
"""
|
||||
|
||||
CUI_LIST = ["39634534"]
|
||||
NAME_PATTERNS = ["PICTUS", "PICTUS VELUM", "P1CTUS", "PICTUS VELUM SRL"]
|
||||
STORE_NAME = "PICTUS VELUM SRL"
|
||||
|
||||
# Standard TVA patterns (flexible - accepts any rate)
|
||||
TVA_PATTERNS = [
|
||||
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
|
||||
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "A - XX,XX% = YY,YY"
|
||||
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "TVA XX% YY,YY" (simple format without code)
|
||||
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||
]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract TVA entries from receipt text.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set()
|
||||
|
||||
# Try coded patterns first
|
||||
for pattern in self.TVA_PATTERNS[:2]:
|
||||
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
amount = self._parse_decimal(match.group(3))
|
||||
|
||||
if amount and amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation, IndexError):
|
||||
continue
|
||||
|
||||
# Fallback to simple format
|
||||
if not entries:
|
||||
simple_pattern = self.TVA_PATTERNS[2]
|
||||
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
percent = int(match.group(1))
|
||||
amount = self._parse_decimal(match.group(2))
|
||||
|
||||
if amount and amount > 0:
|
||||
entries.append({
|
||||
'code': 'A',
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
break
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return entries
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return PICTUS VELUM-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": True,
|
||||
"has_client_cui": False,
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
}
|
||||
111
backend/modules/data_entry/services/ocr/profiles/socar.py
Normal file
111
backend/modules/data_entry/services/ocr/profiles/socar.py
Normal file
@@ -0,0 +1,111 @@
|
||||
"""
|
||||
SOCAR Petroleum store profile for OCR extraction.
|
||||
|
||||
SOCAR receipts are similar to OMV - gas station with client CUI support.
|
||||
Date format may use YYYY. MM. DD with spaces.
|
||||
"""
|
||||
|
||||
import re
|
||||
from datetime import date
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any, Tuple, Optional
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class SocarProfile(BaseStoreProfile):
|
||||
"""
|
||||
SOCAR PETROLEUM S.A. - standard TVA with client CUI.
|
||||
|
||||
Key characteristics:
|
||||
- Standard TVA format (usually single rate)
|
||||
- Includes client CUI on receipt (for business purchases)
|
||||
- Similar format to OMV/Petrom
|
||||
- Date format may use YYYY. MM. DD (with spaces)
|
||||
"""
|
||||
|
||||
CUI_LIST = ["12546600"]
|
||||
NAME_PATTERNS = ["SOCAR", "S0CAR", "SOCAR PETROLEUM"] # OCR variants
|
||||
STORE_NAME = "SOCAR PETROLEUM S.A."
|
||||
|
||||
# Standard TVA patterns for gas stations
|
||||
TVA_PATTERNS = [
|
||||
# Table format: "A-19,00% 285,66 49,58"
|
||||
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]\d{2}\s*%\s+([\d.,]+)\s+([\d.,]+)',
|
||||
# Simple format: "TVA 19% 49,58"
|
||||
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||
]
|
||||
|
||||
# Gas stations may use YYYY. MM. DD format
|
||||
DATE_PATTERNS_OCR_SPACES = [
|
||||
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})\s+\d{2}:\d{2}', 0.98, 'ymd'),
|
||||
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})', 0.95, 'ymd'),
|
||||
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})\s+\d{2}:\d{2}', 0.92, 'dmy'),
|
||||
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})', 0.85, 'dmy'),
|
||||
]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract SOCAR-specific TVA entries.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set()
|
||||
|
||||
# Try table format first
|
||||
table_pattern = self.TVA_PATTERNS[0]
|
||||
for match in re.finditer(table_pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
tva_amount = self._parse_decimal(match.group(4))
|
||||
|
||||
if tva_amount and tva_amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': tva_amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
# Fallback to simple format if no table entries found
|
||||
if not entries:
|
||||
simple_pattern = self.TVA_PATTERNS[1]
|
||||
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
percent = int(match.group(1))
|
||||
amount = self._parse_decimal(match.group(2))
|
||||
|
||||
if amount and amount > 0:
|
||||
# Default to code 'A' for simple format
|
||||
entries.append({
|
||||
'code': 'A',
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
break # Only take first match for simple format
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return entries
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return SOCAR-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": False,
|
||||
"has_client_cui": True,
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
}
|
||||
@@ -0,0 +1,112 @@
|
||||
"""
|
||||
STEPOUT MARKET SRL store profile for OCR extraction.
|
||||
|
||||
Bookstore with reduced TVA rate (5% for books in Romania).
|
||||
"""
|
||||
|
||||
import re
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class StepoutMarketProfile(BaseStoreProfile):
|
||||
"""
|
||||
STEPOUT MARKET SRL - reduced TVA rate profile (books).
|
||||
|
||||
Key characteristics:
|
||||
- Reduced TVA rate: 5% for books (cărți qualification in Romania)
|
||||
- May also have standard rates for non-book items
|
||||
- Patterns are flexible to accept ANY TVA rate
|
||||
- CARD payment typical
|
||||
"""
|
||||
|
||||
CUI_LIST = ["35532655"]
|
||||
NAME_PATTERNS = ["STEPOUT", "STEPOUT MARKET", "STEP0UT", "STEPOUT MARKET SRL"]
|
||||
STORE_NAME = "STEPOUT MARKET SRL"
|
||||
|
||||
# TVA patterns (flexible - accepts any rate including 5%)
|
||||
TVA_PATTERNS = [
|
||||
# "TVA A: 5% = YY,YY" or "TVA-A 5% YY,YY" (coded format)
|
||||
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "A - 5,00% = YY,YY" (table format)
|
||||
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "TVA 5% YY,YY" (simple format - common for single rate)
|
||||
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||
# "TVA 5,00%: YY,YY" (percent with colon)
|
||||
r'TVA\s+(\d{1,2})[.,]\d{2}\s*%\s*:?\s*([\d.,]+)',
|
||||
]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract TVA entries from receipt text.
|
||||
|
||||
Stepout Market primarily sells books which have 5% TVA in Romania.
|
||||
The patterns are generic and will extract whatever rate is on the receipt.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set()
|
||||
|
||||
# Try coded patterns first (have code letter)
|
||||
for pattern in self.TVA_PATTERNS[:2]:
|
||||
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
amount = self._parse_decimal(match.group(3))
|
||||
|
||||
if amount and amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation, IndexError):
|
||||
continue
|
||||
|
||||
# Fallback to simple format (no code letter, just percent + amount)
|
||||
if not entries:
|
||||
for pattern in self.TVA_PATTERNS[2:]:
|
||||
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
percent = int(match.group(1))
|
||||
amount = self._parse_decimal(match.group(2))
|
||||
|
||||
if amount and amount > 0:
|
||||
# Default to code 'A' for simple format
|
||||
entries.append({
|
||||
'code': 'A',
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
break # Only take first match for simple format
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
if entries:
|
||||
break
|
||||
|
||||
return entries
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return STEPOUT MARKET-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": True,
|
||||
"has_client_cui": True, # May have client CUI
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
"typical_tva_rate": 5, # Books have 5% TVA in Romania
|
||||
"product_category": "books",
|
||||
}
|
||||
@@ -0,0 +1,103 @@
|
||||
"""
|
||||
UNLIMITED KEYS S.R.L. store profile for OCR extraction.
|
||||
|
||||
Key duplication service. Notable for CASH (NUMERAR) payments.
|
||||
"""
|
||||
|
||||
import re
|
||||
from decimal import Decimal, InvalidOperation
|
||||
from typing import List, Dict, Any
|
||||
|
||||
from .base import BaseStoreProfile
|
||||
from . import ProfileRegistry
|
||||
|
||||
|
||||
@ProfileRegistry.register
|
||||
class UnlimitedKeysProfile(BaseStoreProfile):
|
||||
"""
|
||||
UNLIMITED KEYS S.R.L. - standard TVA profile with NUMERAR payment.
|
||||
|
||||
Key characteristics:
|
||||
- Standard TVA format (single rate, any percentage)
|
||||
- Key duplication service
|
||||
- NUMERAR (cash) payment common - different from most stores!
|
||||
- May also accept CARD
|
||||
"""
|
||||
|
||||
CUI_LIST = ["18993187"]
|
||||
NAME_PATTERNS = ["UNLIMITED KEYS", "UNLIMITED", "UNL1MITED", "UNLIMITED KEYS SRL"]
|
||||
STORE_NAME = "UNLIMITED KEYS S.R.L."
|
||||
|
||||
# Standard TVA patterns (flexible - accepts any rate)
|
||||
TVA_PATTERNS = [
|
||||
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
|
||||
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "A - XX,XX% = YY,YY"
|
||||
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||
# "TVA XX% YY,YY" (simple format without code)
|
||||
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||
]
|
||||
|
||||
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Extract TVA entries from receipt text.
|
||||
|
||||
Args:
|
||||
text: Raw OCR text from receipt
|
||||
|
||||
Returns:
|
||||
List of TVA entries with code, percent, and amount
|
||||
"""
|
||||
entries = []
|
||||
seen = set()
|
||||
|
||||
# Try coded patterns first
|
||||
for pattern in self.TVA_PATTERNS[:2]:
|
||||
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
amount = self._parse_decimal(match.group(3))
|
||||
|
||||
if amount and amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
except (ValueError, InvalidOperation, IndexError):
|
||||
continue
|
||||
|
||||
# Fallback to simple format
|
||||
if not entries:
|
||||
simple_pattern = self.TVA_PATTERNS[2]
|
||||
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
percent = int(match.group(1))
|
||||
amount = self._parse_decimal(match.group(2))
|
||||
|
||||
if amount and amount > 0:
|
||||
entries.append({
|
||||
'code': 'A',
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
break
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return entries
|
||||
|
||||
def get_validation_hints(self) -> Dict[str, Any]:
|
||||
"""Return UNLIMITED KEYS-specific validation hints."""
|
||||
return {
|
||||
"has_multi_rate_tva": False,
|
||||
"card_equals_total": False, # May be NUMERAR (cash)
|
||||
"has_client_cui": True, # May have client CUI
|
||||
"has_efactura": False,
|
||||
"is_non_vat_payer": False,
|
||||
"common_payment": "NUMERAR", # Cash payments common
|
||||
}
|
||||
@@ -7,6 +7,7 @@ from typing import Optional, Tuple, List
|
||||
from dataclasses import dataclass, field
|
||||
|
||||
from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine
|
||||
from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
|
||||
|
||||
|
||||
@dataclass
|
||||
@@ -63,6 +64,57 @@ class ExtractionResult:
|
||||
class ReceiptExtractor:
|
||||
"""Extract receipt fields using pattern matching for Romanian receipts."""
|
||||
|
||||
# =========================================================================
|
||||
# DEPRECATED: STORE_PROFILES dict - USE ProfileRegistry INSTEAD
|
||||
# =========================================================================
|
||||
# Store profiles are now managed by ProfileRegistry in:
|
||||
# backend/modules/data_entry/services/ocr/profiles/
|
||||
#
|
||||
# This dict is kept for reference only. All extraction logic now uses:
|
||||
# ProfileRegistry.get_profile(cui)
|
||||
#
|
||||
# See: backend/modules/data_entry/services/ocr/profiles/README.md
|
||||
# =========================================================================
|
||||
STORE_PROFILES = {
|
||||
# Lidl - multi-rate TVA (A+B), specific format without hyphen/colon
|
||||
"22891860": {
|
||||
"name": "LIDL DISCOUNT S.R.L.",
|
||||
"tva_pattern": "lidl",
|
||||
"tva_format": "TVA {code} {percent}% {amount}",
|
||||
"has_multi_rate_tva": True,
|
||||
"card_equals_total": True,
|
||||
},
|
||||
# OMV Petrom - single TVA rate, client CUI included
|
||||
"11201891": {
|
||||
"name": "OMV PETROM MARKETING S.R.L.",
|
||||
"tva_pattern": "standard",
|
||||
"has_client_cui": True,
|
||||
},
|
||||
# FIVE-HOLDING (BRICK) - standard format
|
||||
"10562600": {
|
||||
"name": "FIVE-HOLDING S.A.",
|
||||
"tva_pattern": "standard",
|
||||
},
|
||||
# Dedeman - e-factura format
|
||||
"2816464": {
|
||||
"name": "DEDEMAN SRL",
|
||||
"tva_pattern": "standard",
|
||||
"has_efactura": True,
|
||||
},
|
||||
# SOCAR Petroleum
|
||||
"12546600": {
|
||||
"name": "SOCAR PETROLEUM S.A.",
|
||||
"tva_pattern": "standard",
|
||||
"has_client_cui": True,
|
||||
},
|
||||
# Kineterra - non-VAT payer
|
||||
"31180432": {
|
||||
"name": "KINETERRA CONCEPT SRL",
|
||||
"tva_pattern": "none",
|
||||
"is_non_vat_payer": True,
|
||||
},
|
||||
}
|
||||
|
||||
# Total amount patterns (most specific first)
|
||||
# Romanian receipts use various formats: TOTAL LEI, TOTAL:, TOTAL RON, etc.
|
||||
# OCR often produces errors, so patterns must be tolerant
|
||||
@@ -394,48 +446,101 @@ class ReceiptExtractor:
|
||||
result.raw_text = text
|
||||
text_upper = text.upper()
|
||||
|
||||
# Extract core fields
|
||||
result.amount, result.confidence_amount = self._extract_amount(text_upper)
|
||||
result.receipt_date, result.confidence_date = self._extract_date(text_upper)
|
||||
result.receipt_number, _ = self._extract_number(text_upper)
|
||||
result.receipt_series, _ = self._extract_series(text_upper)
|
||||
# =========================================================================
|
||||
# STEP 1: Extract vendor info FIRST to find store profile
|
||||
# =========================================================================
|
||||
result.partner_name, result.confidence_vendor = self._extract_vendor(text)
|
||||
result.cui, _ = self._extract_cui(text_upper, text)
|
||||
# Normalize CUI: fix R0 → RO OCR error and validate format
|
||||
result.cui = OCRValidationEngine.normalize_cui(result.cui)
|
||||
|
||||
# Extract additional fields - Multiple TVA entries
|
||||
result.tva_entries, result.tva_total = self._extract_tva_entries(text_upper)
|
||||
# Lookup store-specific profile for enhanced extraction accuracy
|
||||
store_profile = ProfileRegistry.get_profile(result.cui) if result.cui else None
|
||||
if store_profile:
|
||||
print(f"[Profile] Using {store_profile.__class__.__name__} for CUI {result.cui}", flush=True)
|
||||
|
||||
# =========================================================================
|
||||
# STEP 2: Extract ALL fields using profile (if available) or generic
|
||||
# =========================================================================
|
||||
if store_profile:
|
||||
# Profile-specific extraction (higher accuracy for known stores)
|
||||
result.amount, result.confidence_amount = store_profile.extract_total(text_upper)
|
||||
result.receipt_date, result.confidence_date = store_profile.extract_date(text_upper)
|
||||
result.receipt_number, _ = store_profile.extract_receipt_number(text_upper)
|
||||
result.tva_entries = store_profile.extract_tva_entries(text_upper)
|
||||
result.tva_total = sum(e['amount'] for e in result.tva_entries) if result.tva_entries else None
|
||||
result.payment_methods = store_profile.extract_payment_methods(text_upper)
|
||||
|
||||
# Client data extraction via profile (CUI + name)
|
||||
profile_client_cui, cui_confidence = store_profile.extract_client_cui(text_upper)
|
||||
profile_client_name, name_confidence = store_profile.extract_client_name(text)
|
||||
|
||||
if profile_client_cui or profile_client_name:
|
||||
# Use profile extraction results
|
||||
result.client_cui = OCRValidationEngine.normalize_cui(profile_client_cui) if profile_client_cui else None
|
||||
result.client_name = profile_client_name
|
||||
result.confidence_client = max(cui_confidence, name_confidence)
|
||||
# Address still via generic (no profile method)
|
||||
_, _, client_address, _ = self._extract_client_data(text_upper, text)
|
||||
result.client_address = client_address
|
||||
else:
|
||||
# Fallback to generic client extraction
|
||||
client_name, client_cui, client_address, confidence = self._extract_client_data(text_upper, text)
|
||||
result.client_name = client_name
|
||||
result.client_cui = OCRValidationEngine.normalize_cui(client_cui)
|
||||
result.client_address = client_address
|
||||
result.confidence_client = confidence
|
||||
|
||||
print(f"[Profile] Extracted: total={result.amount}, date={result.receipt_date}, "
|
||||
f"TVA entries={len(result.tva_entries)}, payments={len(result.payment_methods)}", flush=True)
|
||||
else:
|
||||
# Generic extraction for unknown stores
|
||||
result.amount, result.confidence_amount = self._extract_amount(text_upper)
|
||||
result.receipt_date, result.confidence_date = self._extract_date(text_upper)
|
||||
result.receipt_number, _ = self._extract_number(text_upper)
|
||||
result.tva_entries, result.tva_total = self._extract_tva_entries(text_upper)
|
||||
result.payment_methods = self._extract_payment_methods(text_upper)
|
||||
|
||||
# Generic client extraction
|
||||
client_name, client_cui, client_address, confidence = self._extract_client_data(text_upper, text)
|
||||
result.client_name = client_name
|
||||
result.client_cui = OCRValidationEngine.normalize_cui(client_cui)
|
||||
result.client_address = client_address
|
||||
result.confidence_client = confidence
|
||||
|
||||
# Series extraction (no profile method, always generic)
|
||||
result.receipt_series, _ = self._extract_series(text_upper)
|
||||
|
||||
# =========================================================================
|
||||
# STEP 3: Debug logging and validation
|
||||
# =========================================================================
|
||||
if not result.tva_entries:
|
||||
print(f"[TVA Debug] No TVA found. Checking patterns...", flush=True)
|
||||
# Debug: show what patterns see
|
||||
normalized = re.sub(r'(\d+)[.,]\s+(\d{2})', r'\1.\2', text_upper)
|
||||
taxe_match = re.search(r'T?OTAL\s+TAXE', normalized, re.IGNORECASE)
|
||||
rev_match = re.search(r'([\d.,]+)\s*T?OTAL\s+TAXE', normalized, re.IGNORECASE)
|
||||
print(f"[TVA Debug] 'OTAL TAXE' found: {bool(taxe_match)}, reversed: {rev_match.group(1) if rev_match else None}", flush=True)
|
||||
|
||||
# Log TVA vs TOTAL for debugging (validation happens in ocr_service._final_validation)
|
||||
# NOTE: We NO LONGER clear TVA here - the service will recalculate TOTAL from TVA if needed
|
||||
# Log TVA vs TOTAL for debugging
|
||||
if result.tva_total and result.amount:
|
||||
if result.tva_total > result.amount:
|
||||
print(f"[TVA Extraction] TVA ({result.tva_total}) > TOTAL ({result.amount}) - will be corrected in final validation", flush=True)
|
||||
elif result.tva_total > result.amount * Decimal('0.5'):
|
||||
print(f"[TVA Extraction] Warning: TVA ({result.tva_total}) is > 50% of TOTAL ({result.amount}) - suspicious", flush=True)
|
||||
|
||||
# Additional generic extractions
|
||||
result.items_count = self._extract_items_count(text_upper)
|
||||
result.address = self._extract_address(text_upper)
|
||||
result.payment_methods = self._extract_payment_methods(text_upper)
|
||||
|
||||
# Validate payment methods against extracted amount
|
||||
# If payment sum >> amount, clear invalid payments (likely OCR error)
|
||||
# =========================================================================
|
||||
# STEP 4: Validate and post-process
|
||||
# =========================================================================
|
||||
# Save original payment methods before validation (for payment mode detection)
|
||||
original_payment_methods = result.payment_methods.copy() if result.payment_methods else []
|
||||
|
||||
# Validate payment methods against extracted amount
|
||||
result.payment_methods = self._validate_payment_methods(result.payment_methods, result.amount)
|
||||
|
||||
# Auto-suggest payment_mode based on detected payment methods
|
||||
# Use ORIGINAL payment_methods to detect CARD even if validation cleared them
|
||||
# (e.g., CARD 318.16 is valid even if total validation failed)
|
||||
payment_methods_for_mode = result.payment_methods if result.payment_methods else original_payment_methods
|
||||
if payment_methods_for_mode:
|
||||
card_amount = sum(
|
||||
@@ -447,17 +552,9 @@ class ReceiptExtractor:
|
||||
result.suggested_payment_mode = 'banca'
|
||||
print(f"[Payment Mode] CARD detected ({card_amount}), suggesting 'banca'", flush=True)
|
||||
else:
|
||||
# Only cash payments detected
|
||||
result.suggested_payment_mode = 'numerar'
|
||||
print(f"[Payment Mode] Cash only detected, suggesting 'numerar'", flush=True)
|
||||
|
||||
# Extract client data (B2B receipts)
|
||||
client_name, client_cui, client_address, confidence_client = self._extract_client_data(text_upper, text)
|
||||
result.client_name = client_name
|
||||
result.client_cui = OCRValidationEngine.normalize_cui(client_cui) # Fix R0 → RO OCR error
|
||||
result.client_address = client_address
|
||||
result.confidence_client = confidence_client
|
||||
|
||||
# Detect receipt type
|
||||
result.receipt_type = self._detect_receipt_type(text_upper)
|
||||
|
||||
@@ -620,6 +717,40 @@ class ReceiptExtractor:
|
||||
|
||||
return num_str
|
||||
|
||||
def _calculate_multi_rate_tva_total(self, tva_entries: List[dict]) -> Optional[Decimal]:
|
||||
"""
|
||||
Calculate implied total from ALL TVA entries (multi-rate support).
|
||||
|
||||
Formula for each entry: total_for_entry = tva * (100 + rate) / rate
|
||||
Final total = sum of all entry totals
|
||||
|
||||
Example for Lidl (TVA A 21% = 7.71, TVA B 11% = 2.13):
|
||||
Entry A: 7.71 * 121 / 21 = 44.45
|
||||
Entry B: 2.13 * 111 / 11 = 21.49
|
||||
Total: 44.45 + 21.49 = 65.94 ≈ 65.86 (within tolerance)
|
||||
|
||||
Returns:
|
||||
Implied total Decimal, or None if calculation not possible
|
||||
"""
|
||||
if not tva_entries:
|
||||
return None
|
||||
|
||||
total = Decimal('0')
|
||||
for entry in tva_entries:
|
||||
rate = entry.get('percent', 0)
|
||||
tva_amount = entry.get('amount')
|
||||
if tva_amount and rate > 0:
|
||||
try:
|
||||
tva_dec = Decimal(str(tva_amount))
|
||||
# Formula: total_for_entry = tva * (100 + rate) / rate
|
||||
entry_total = tva_dec * Decimal(100 + rate) / Decimal(rate)
|
||||
total += entry_total
|
||||
print(f"[Multi-rate TVA] Entry {entry.get('code', '?')}: tva={tva_amount}, rate={rate}% -> implied={entry_total:.2f}", flush=True)
|
||||
except (InvalidOperation, ValueError, TypeError):
|
||||
continue
|
||||
|
||||
return total.quantize(Decimal('0.01')) if total > 0 else None
|
||||
|
||||
def _cross_validate_and_calculate_amount(
|
||||
self,
|
||||
amount: Optional[Decimal],
|
||||
@@ -634,12 +765,11 @@ class ReceiptExtractor:
|
||||
Returns: (amount, confidence, source_description)
|
||||
|
||||
Logic:
|
||||
1. If amount is valid (>0) with high confidence (>=0.8), use it directly
|
||||
2. Calculate payment_sum = CARD + NUMERAR + other methods
|
||||
3. Calculate tva_implied_total = tva_total * (100 + rate) / rate
|
||||
4. Cross-validate: if payment_sum matches extracted amount, boost confidence
|
||||
5. If amount is 0/None, use payment_sum as total
|
||||
6. If payment_sum is 0, try to calculate from TVA
|
||||
1. Collect all available sources: extracted amount, payment sum, TVA-implied total
|
||||
2. Find consensus: 2+ sources within 3% tolerance
|
||||
3. If consensus found, use the higher-confidence source value
|
||||
4. If extracted differs >10% from all others, it's an outlier - correct it
|
||||
5. If no consensus possible, fallback to individual validations
|
||||
"""
|
||||
# Calculate payment methods sum
|
||||
payment_sum = Decimal('0')
|
||||
@@ -652,43 +782,73 @@ class ReceiptExtractor:
|
||||
except (InvalidOperation, ValueError, TypeError):
|
||||
continue
|
||||
|
||||
# Calculate TVA-implied total: total = tva * (100 + rate) / rate
|
||||
tva_implied_total = None
|
||||
if tva_entries:
|
||||
# Use the main TVA entry (typically the largest or first one)
|
||||
main_entry = tva_entries[0]
|
||||
rate = main_entry.get('percent', 19)
|
||||
tva_amount = main_entry.get('amount')
|
||||
if tva_amount and rate > 0:
|
||||
try:
|
||||
tva_dec = Decimal(str(tva_amount))
|
||||
# total = tva * (100 + rate) / rate
|
||||
tva_implied_total = (tva_dec * Decimal(100 + rate) / Decimal(rate)).quantize(Decimal('0.01'))
|
||||
except (InvalidOperation, ValueError, TypeError):
|
||||
pass
|
||||
# Calculate TVA-implied total using ALL entries (multi-rate fix)
|
||||
tva_implied_total = self._calculate_multi_rate_tva_total(tva_entries)
|
||||
|
||||
# Case 1: Amount is valid with high confidence - validate against TVA and payments
|
||||
# Multi-source consensus approach (3% tolerance for multi-rate TVA rounding)
|
||||
CONSENSUS_TOLERANCE = 3.0 # 3% tolerance
|
||||
|
||||
# Collect all available sources with their confidences
|
||||
sources = []
|
||||
if amount and amount > 0:
|
||||
sources.append(('extracted', float(amount), confidence_amount))
|
||||
if payment_sum > 0:
|
||||
sources.append(('payment', float(payment_sum), 0.92)) # Payment is very reliable
|
||||
if tva_implied_total and tva_implied_total > 0:
|
||||
sources.append(('tva_calc', float(tva_implied_total), 0.88)) # TVA calc is reliable
|
||||
|
||||
print(f"[Cross-Validation] Sources: {[(s[0], f'{s[1]:.2f}', f'{s[2]:.2f}') for s in sources]}", flush=True)
|
||||
|
||||
# Find consensus: 2+ sources within tolerance
|
||||
if len(sources) >= 2:
|
||||
for i, (name1, val1, conf1) in enumerate(sources):
|
||||
for name2, val2, conf2 in sources[i+1:]:
|
||||
if val1 <= 0 or val2 <= 0:
|
||||
continue
|
||||
diff_pct = abs(val1 - val2) / max(val1, val2) * 100
|
||||
if diff_pct <= CONSENSUS_TOLERANCE:
|
||||
# Consensus found! Use value from higher-confidence source
|
||||
if conf1 >= conf2:
|
||||
consensus_val, consensus_conf = val1, conf1
|
||||
else:
|
||||
consensus_val, consensus_conf = val2, conf2
|
||||
# Boost confidence for consensus
|
||||
consensus_conf = min(0.98, consensus_conf + 0.05)
|
||||
print(f"[Cross-Validation] Consensus: {name1}={val1:.2f} ≈ {name2}={val2:.2f} (diff={diff_pct:.1f}%)", flush=True)
|
||||
return Decimal(str(round(consensus_val, 2))), consensus_conf, f"consensus ({name1}+{name2})"
|
||||
|
||||
# No consensus - check if extracted is an outlier (differs >10% from all others)
|
||||
if amount and amount > 0 and len(sources) >= 2:
|
||||
other_sources = [s for s in sources if s[0] != 'extracted']
|
||||
if other_sources:
|
||||
extracted_val = float(amount)
|
||||
all_differ = all(
|
||||
abs(extracted_val - s[1]) / max(extracted_val, s[1]) * 100 > 10
|
||||
for s in other_sources if s[1] > 0
|
||||
)
|
||||
if all_differ:
|
||||
# Extracted differs significantly from all others - use the best other source
|
||||
best_other = max(other_sources, key=lambda s: s[2])
|
||||
print(f"[Cross-Validation] Extracted outlier: {extracted_val:.2f} differs >10% from all others, using {best_other[0]}={best_other[1]:.2f}", flush=True)
|
||||
return Decimal(str(round(best_other[1], 2))), best_other[2], f"corrected (extracted outlier, using {best_other[0]})"
|
||||
|
||||
# Fallback: Case 1 - Amount valid with high confidence
|
||||
if amount and amount > 0 and confidence_amount >= 0.8:
|
||||
# First check TVA-implied total (most reliable when TVA is extracted correctly)
|
||||
# Check TVA-implied total
|
||||
if tva_implied_total and tva_implied_total > 0:
|
||||
tva_diff_percent = abs(float(amount) - float(tva_implied_total)) / float(tva_implied_total) * 100
|
||||
if tva_diff_percent <= 1:
|
||||
# Near-perfect TVA match - highest confidence
|
||||
if tva_diff_percent <= 3:
|
||||
return amount, min(0.98, confidence_amount + 0.05), "extracted (validated by TVA)"
|
||||
elif tva_diff_percent > 10:
|
||||
# Significant mismatch - TVA-implied total is more reliable
|
||||
# This catches cases where wrong TOTAL line was extracted (e.g., REST, SUBTOTAL)
|
||||
print(f"[Cross-Validation] Amount mismatch with TVA: extracted={amount}, tva_implied={tva_implied_total} (diff={tva_diff_percent:.1f}%)", flush=True)
|
||||
return tva_implied_total, 0.90, "calculated from TVA (extracted amount mismatch)"
|
||||
|
||||
# Cross-validate with payment methods
|
||||
if payment_sum > 0 and abs(amount - payment_sum) <= Decimal('0.02'):
|
||||
# Perfect match - boost confidence
|
||||
return amount, min(0.98, confidence_amount + 0.05), "extracted (validated by payment methods)"
|
||||
elif payment_sum > 0:
|
||||
payment_diff_percent = abs(float(amount) - float(payment_sum)) / float(payment_sum) * 100
|
||||
if payment_diff_percent > 10:
|
||||
# Significant mismatch - payment sum is more reliable
|
||||
print(f"[Cross-Validation] Amount mismatch with payments: extracted={amount}, payments={payment_sum} (diff={payment_diff_percent:.1f}%)", flush=True)
|
||||
return payment_sum, 0.88, "calculated from payment methods (extracted amount mismatch)"
|
||||
|
||||
@@ -696,29 +856,22 @@ class ReceiptExtractor:
|
||||
|
||||
# Case 2: Amount exists but low confidence - try to validate/correct
|
||||
if amount and amount > 0:
|
||||
# First check TVA-implied total (most reliable)
|
||||
if tva_implied_total and tva_implied_total > 0:
|
||||
tva_diff_percent = abs(float(amount) - float(tva_implied_total)) / float(tva_implied_total) * 100
|
||||
if tva_diff_percent <= 2:
|
||||
# Close match - boost confidence
|
||||
if tva_diff_percent <= 3:
|
||||
return amount, 0.88, "extracted (validated by TVA)"
|
||||
elif tva_diff_percent > 10:
|
||||
# Significant mismatch - use TVA-implied total
|
||||
print(f"[Cross-Validation] Amount mismatch with TVA: extracted={amount}, tva_implied={tva_implied_total} (diff={tva_diff_percent:.1f}%)", flush=True)
|
||||
return tva_implied_total, 0.85, "calculated from TVA"
|
||||
|
||||
# Check if payment methods sum matches
|
||||
if payment_sum > 0:
|
||||
payment_diff_percent = abs(float(amount) - float(payment_sum)) / float(payment_sum) * 100
|
||||
if payment_diff_percent <= 0.5:
|
||||
# Close match - boost confidence
|
||||
if payment_diff_percent <= 1:
|
||||
return amount, 0.90, "extracted (validated by payment methods)"
|
||||
elif payment_diff_percent > 10:
|
||||
# Mismatch - prefer payment_sum as it's more reliable
|
||||
print(f"[Cross-Validation] Amount mismatch: extracted={amount}, payments={payment_sum}", flush=True)
|
||||
return payment_sum, 0.85, "calculated from payment methods"
|
||||
|
||||
# No validation possible - return as-is
|
||||
return amount, confidence_amount, "extracted (unvalidated)"
|
||||
|
||||
# Case 3: Amount is 0 or None - calculate from payment methods
|
||||
@@ -946,6 +1099,28 @@ class ReceiptExtractor:
|
||||
|
||||
return name
|
||||
|
||||
def _get_store_profile(self, cui: Optional[str]) -> Optional[dict]:
|
||||
"""
|
||||
Get store-specific profile by CUI.
|
||||
|
||||
DEPRECATED: Use ProfileRegistry.get_profile() directly for profile objects.
|
||||
This method is kept for backward compatibility and returns validation hints dict.
|
||||
|
||||
Args:
|
||||
cui: The CUI extracted from receipt (with or without RO prefix)
|
||||
|
||||
Returns:
|
||||
Store profile validation hints dict or None if not found
|
||||
"""
|
||||
profile = ProfileRegistry.get_profile(cui)
|
||||
if profile:
|
||||
# Return validation hints for backward compatibility
|
||||
hints = profile.get_validation_hints()
|
||||
hints['name'] = profile.STORE_NAME
|
||||
print(f"[Store Profile] Found profile for {cui}: {profile.STORE_NAME}", flush=True)
|
||||
return hints
|
||||
return None
|
||||
|
||||
def _extract_cui(self, text_upper: str, original_text: str) -> Tuple[Optional[str], float]:
|
||||
"""
|
||||
Extract vendor CUI (fiscal identification code) from text.
|
||||
@@ -1020,11 +1195,114 @@ class ReceiptExtractor:
|
||||
# Default to bon_fiscal if neither found
|
||||
return 'bon_fiscal'
|
||||
|
||||
def _try_pattern_lidl(self, text: str) -> List[dict]:
|
||||
"""
|
||||
Try Lidl-style TVA pattern: "TVA A 21,00% 7.71" (no hyphen/colon separator).
|
||||
|
||||
Lidl receipts format:
|
||||
TOTAL TVA 9,84
|
||||
TVA A 21,00% 7,71
|
||||
TVA B 11,00% 2,13
|
||||
|
||||
Returns list of TVA entries found.
|
||||
"""
|
||||
entries = []
|
||||
seen = set()
|
||||
|
||||
# Pattern: TVA/TUA/IVA + code (A-D) + percent + amount (on same line)
|
||||
# Handles: "TVA A 21,00% 7,71", "TVA B 11,00% 2,13", "TUA A 21% 7.71"
|
||||
lidl_patterns = [
|
||||
# Same line: "TVA A 21,00% 7.71" (with various spacing)
|
||||
r'T[VU][AR]\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
|
||||
# Same line with backslash (OCR artifact): "TVA A \21,00% 7.71"
|
||||
r'T[VU][AR]\s+([A-D])\s+\\?(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
|
||||
# IVA variant
|
||||
r'IVA\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
|
||||
]
|
||||
|
||||
for pattern in lidl_patterns:
|
||||
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||
try:
|
||||
code = match.group(1).upper()
|
||||
percent = int(match.group(2))
|
||||
amount_str = self._normalize_number(match.group(3))
|
||||
amount = Decimal(amount_str)
|
||||
|
||||
if amount > 0:
|
||||
entry_key = (code, percent)
|
||||
if entry_key not in seen:
|
||||
entries.append({
|
||||
'code': code,
|
||||
'percent': percent,
|
||||
'amount': amount
|
||||
})
|
||||
seen.add(entry_key)
|
||||
print(f"[TVA Lidl] Found: TVA {code} {percent}% = {amount}", flush=True)
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
return entries
|
||||
|
||||
def _select_best_tva_candidate(
|
||||
self,
|
||||
candidates: List[tuple],
|
||||
tva_bon_total: Optional[Decimal]
|
||||
) -> Tuple[List[dict], Optional[Decimal]]:
|
||||
"""
|
||||
Select the best TVA candidate from collected candidates.
|
||||
|
||||
Selection criteria (priority order):
|
||||
1. Sum matches TOTAL TVA BON (highest priority)
|
||||
2. More entries = better (for multi-rate receipts)
|
||||
3. Pattern confidence as tiebreaker
|
||||
|
||||
Args:
|
||||
candidates: List of (pattern_name, confidence, entries, sum)
|
||||
tva_bon_total: Authoritative TOTAL TVA BON value (if extracted)
|
||||
|
||||
Returns:
|
||||
(best_entries, best_sum)
|
||||
"""
|
||||
if not candidates:
|
||||
return [], None
|
||||
|
||||
# Score each candidate
|
||||
scored = []
|
||||
for name, confidence, entries, sum_val in candidates:
|
||||
score = 0.0
|
||||
|
||||
# Criterion 1: Sum matches TOTAL TVA BON (highest priority)
|
||||
if tva_bon_total and sum_val:
|
||||
tolerance = max(Decimal('0.02'), tva_bon_total * Decimal('0.02')) # 2% tolerance
|
||||
if abs(sum_val - tva_bon_total) <= tolerance:
|
||||
score += 100 # High bonus for matching authoritative total
|
||||
print(f"[TVA Select] {name}: sum {sum_val} matches tva_bon_total {tva_bon_total}", flush=True)
|
||||
|
||||
# Criterion 2: More entries (for multi-rate receipts)
|
||||
score += len(entries) * 10
|
||||
|
||||
# Criterion 3: Pattern confidence
|
||||
score += confidence * 5
|
||||
|
||||
scored.append((score, name, confidence, entries, sum_val))
|
||||
print(f"[TVA Select] Candidate {name}: score={score:.1f}, entries={len(entries)}, sum={sum_val}", flush=True)
|
||||
|
||||
# Sort by score descending
|
||||
scored.sort(key=lambda x: x[0], reverse=True)
|
||||
best = scored[0]
|
||||
print(f"[TVA Select] Winner: {best[1]} (score={best[0]:.1f})", flush=True)
|
||||
|
||||
return best[3], best[4]
|
||||
|
||||
def _extract_tva_entries(self, text: str) -> Tuple[List[dict], Optional[Decimal]]:
|
||||
"""
|
||||
Extract multiple TVA (VAT) entries from text.
|
||||
Romanian receipts can have multiple TVA rates (A=19%, B=9%, C=5%, D=0%).
|
||||
|
||||
Uses CANDIDATE COLLECTION approach:
|
||||
- Try ALL patterns and collect candidates
|
||||
- Select best candidate based on matching TOTAL TVA BON
|
||||
|
||||
Returns (tva_entries, tva_total) where tva_entries is a list of:
|
||||
{'code': 'A', 'percent': 19, 'amount': Decimal('15.20')}
|
||||
"""
|
||||
@@ -1054,6 +1332,22 @@ class ReceiptExtractor:
|
||||
# Also normalize comma followed by space to comma (for "21, 00%" -> "21,00%")
|
||||
normalized_text = re.sub(r'(\d+),\s+(\d{2})\s*%', r'\1.\2%', normalized_text)
|
||||
|
||||
# Extract TOTAL TVA BON/TOTAL TVA first as the authoritative reference
|
||||
tva_bon_total = self._extract_total_tva_bon(normalized_text)
|
||||
print(f"[TVA Debug] TOTAL TVA BON: {tva_bon_total}", flush=True)
|
||||
|
||||
# CANDIDATE COLLECTION APPROACH: Try all patterns, collect candidates, select best
|
||||
all_candidates = [] # List of (pattern_name, confidence, entries, sum)
|
||||
|
||||
# === LIDL-STYLE PATTERNS (NEW) ===
|
||||
# Lidl format: "TVA A 21,00% 7.71" or "TVA B 11,00% 2.13" (no hyphen/colon)
|
||||
# This pattern handles multi-rate TVA receipts
|
||||
lidl_entries = self._try_pattern_lidl(normalized_text)
|
||||
if lidl_entries:
|
||||
lidl_sum = sum(e['amount'] for e in lidl_entries)
|
||||
all_candidates.append(('lidl', 0.96, lidl_entries, lidl_sum))
|
||||
print(f"[TVA Debug] Lidl pattern: {len(lidl_entries)} entries, sum={lidl_sum}", flush=True)
|
||||
|
||||
# Pattern 0a: First try to get TVA from "TOTAL TAXE:" which is most reliable
|
||||
# Format: "TOTAL TAXE: 55,22" - this is always the TVA amount
|
||||
# OCR may cut "T" producing "OTAL TAXE:" instead of "TOTAL TAXE:"
|
||||
@@ -1372,10 +1666,21 @@ class ReceiptExtractor:
|
||||
except (ValueError, InvalidOperation):
|
||||
continue
|
||||
|
||||
# Extract TOTAL TVA BON as reference (separate from individual entries)
|
||||
tva_bon_total = self._extract_total_tva_bon(normalized_text)
|
||||
# Add existing extraction results to candidates (if any)
|
||||
if tva_entries:
|
||||
entries_sum = sum(entry['amount'] for entry in tva_entries)
|
||||
all_candidates.append(('standard', 0.90, tva_entries, entries_sum))
|
||||
print(f"[TVA Debug] Standard patterns: {len(tva_entries)} entries, sum={entries_sum}", flush=True)
|
||||
|
||||
# Calculate sum from entries
|
||||
# === CANDIDATE SELECTION ===
|
||||
# Select best candidate using TOTAL TVA BON as authoritative reference
|
||||
if all_candidates:
|
||||
best_entries, best_sum = self._select_best_tva_candidate(all_candidates, tva_bon_total)
|
||||
if best_entries:
|
||||
tva_entries = best_entries
|
||||
entries_sum = best_sum
|
||||
|
||||
# Calculate sum from entries (if not set by candidate selection)
|
||||
entries_sum = None
|
||||
if tva_entries:
|
||||
entries_sum = sum(entry['amount'] for entry in tva_entries)
|
||||
|
||||
600
scripts/generate_store_profile.py
Executable file
600
scripts/generate_store_profile.py
Executable file
@@ -0,0 +1,600 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Store Profile Generator Script
|
||||
|
||||
Analyzes PDF receipts from a store and generates a Python profile class
|
||||
for the OCR extraction system.
|
||||
|
||||
Usage:
|
||||
python scripts/generate_store_profile.py \
|
||||
--name "Magazin Exemplu" \
|
||||
--cui "12345678" \
|
||||
--receipts "docs/data-entry/MagazinExemplu*.pdf" \
|
||||
--output "backend/modules/data_entry/services/ocr/profiles/magazin_exemplu.py"
|
||||
|
||||
Features:
|
||||
- Submits PDFs to OCR API
|
||||
- Analyzes extracted text for patterns (TVA, total, date, payment)
|
||||
- Generates a BaseStoreProfile subclass with detected patterns
|
||||
- Supports hot-reload via ProfileRegistry
|
||||
|
||||
Requirements:
|
||||
- Backend server running on localhost:8000
|
||||
- JWT authentication
|
||||
- python-jose, requests packages
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import glob
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
from collections import Counter, defaultdict
|
||||
from datetime import datetime, timedelta, timezone
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional, Tuple
|
||||
|
||||
try:
|
||||
import requests
|
||||
from jose import jwt
|
||||
except ImportError:
|
||||
print("Error: Required packages not installed.")
|
||||
print("Run: pip install python-jose requests")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
# Configuration
|
||||
API_BASE = os.getenv("API_BASE", "http://localhost:8000")
|
||||
JWT_SECRET = os.getenv("JWT_SECRET_KEY", "GENERATE_NEW_SECRET_FOR_PRODUCTION3334!")
|
||||
|
||||
|
||||
def create_jwt_token() -> str:
|
||||
"""Create a test JWT token for API authentication."""
|
||||
payload = {
|
||||
"username": "PROFILE_GENERATOR",
|
||||
"user_id": 1,
|
||||
"companies": ["604"],
|
||||
"permissions": ["read", "write"],
|
||||
"exp": datetime.now(timezone.utc) + timedelta(hours=1),
|
||||
"iat": datetime.now(timezone.utc),
|
||||
"type": "access"
|
||||
}
|
||||
return jwt.encode(payload, JWT_SECRET, algorithm="HS256")
|
||||
|
||||
|
||||
def submit_ocr(pdf_path: str, token: str, api_base: str = API_BASE, timeout: int = 120) -> Optional[Dict]:
|
||||
"""
|
||||
Submit a PDF to OCR API and wait for result.
|
||||
|
||||
Args:
|
||||
pdf_path: Path to PDF file
|
||||
token: JWT authentication token
|
||||
api_base: API base URL
|
||||
timeout: Max seconds to wait for completion
|
||||
|
||||
Returns:
|
||||
Extraction result dict or None on failure
|
||||
"""
|
||||
headers = {"Authorization": f"Bearer {token}"}
|
||||
filename = os.path.basename(pdf_path)
|
||||
|
||||
print(f" Submitting: {filename}...", end=" ", flush=True)
|
||||
|
||||
try:
|
||||
with open(pdf_path, "rb") as f:
|
||||
files = {"file": (filename, f, "application/pdf")}
|
||||
response = requests.post(
|
||||
f"{api_base}/api/data-entry/ocr/extract?engine=doctr_plus",
|
||||
files=files,
|
||||
headers=headers,
|
||||
timeout=30
|
||||
)
|
||||
|
||||
if response.status_code != 200:
|
||||
print(f"FAILED (HTTP {response.status_code})")
|
||||
return None
|
||||
|
||||
job_data = response.json()
|
||||
job_id = job_data.get("job_id")
|
||||
|
||||
if not job_id:
|
||||
print("FAILED (no job_id)")
|
||||
return None
|
||||
|
||||
# Poll for completion
|
||||
start_time = time.time()
|
||||
while time.time() - start_time < timeout:
|
||||
poll_response = requests.get(
|
||||
f"{api_base}/api/data-entry/ocr/jobs/{job_id}/wait?timeout=30",
|
||||
headers=headers,
|
||||
timeout=35
|
||||
)
|
||||
|
||||
if poll_response.status_code == 200:
|
||||
job_result = poll_response.json()
|
||||
status = job_result.get("status")
|
||||
|
||||
if status == "completed":
|
||||
elapsed = time.time() - start_time
|
||||
print(f"OK ({elapsed:.1f}s)")
|
||||
return job_result.get("result", {})
|
||||
elif status == "error":
|
||||
print(f"ERROR: {job_result.get('error', 'Unknown')}")
|
||||
return None
|
||||
|
||||
time.sleep(2)
|
||||
|
||||
print("TIMEOUT")
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
print(f"EXCEPTION: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def analyze_tva_patterns(results: List[Dict]) -> Dict:
|
||||
"""
|
||||
Analyze TVA patterns from multiple extraction results.
|
||||
|
||||
Returns:
|
||||
Dict with detected patterns and statistics
|
||||
"""
|
||||
tva_entries = []
|
||||
raw_texts = []
|
||||
|
||||
for r in results:
|
||||
if r.get("tva_entries"):
|
||||
tva_entries.extend(r["tva_entries"])
|
||||
if r.get("raw_text"):
|
||||
raw_texts.append(r["raw_text"])
|
||||
|
||||
# Analyze TVA code patterns (A, B, C, etc.)
|
||||
codes = Counter(e.get("code") for e in tva_entries if e.get("code"))
|
||||
|
||||
# Analyze TVA percentage patterns
|
||||
percents = Counter(e.get("percent") for e in tva_entries if e.get("percent"))
|
||||
|
||||
# Detect TVA format from raw text
|
||||
tva_formats = defaultdict(int)
|
||||
for text in raw_texts:
|
||||
text_upper = text.upper()
|
||||
|
||||
# Standard format: "TVA 19% 10.50" or "TVA: 19% 10.50"
|
||||
if re.search(r'TVA\s*:?\s*\d{1,2}%', text_upper):
|
||||
tva_formats["standard"] += 1
|
||||
|
||||
# Lidl format: "TVA A 21% 7.71"
|
||||
if re.search(r'TVA\s+[A-D]\s+\d{1,2}', text_upper):
|
||||
tva_formats["lidl_multi_rate"] += 1
|
||||
|
||||
# Table format: "BAZA TVA | % TVA | VALOARE TVA"
|
||||
if re.search(r'BAZA\s+TVA', text_upper):
|
||||
tva_formats["table"] += 1
|
||||
|
||||
# No TVA (neplatitor)
|
||||
if re.search(r'NEPLATITOR|NON.?TVA', text_upper):
|
||||
tva_formats["non_vat"] += 1
|
||||
|
||||
return {
|
||||
"codes": dict(codes),
|
||||
"percents": dict(percents),
|
||||
"formats": dict(tva_formats),
|
||||
"has_multi_rate": len(codes) > 1,
|
||||
"is_non_vat": tva_formats.get("non_vat", 0) > 0,
|
||||
"dominant_format": max(tva_formats, key=tva_formats.get) if tva_formats else "standard"
|
||||
}
|
||||
|
||||
|
||||
def analyze_total_patterns(results: List[Dict]) -> Dict:
|
||||
"""Analyze TOTAL patterns from extraction results."""
|
||||
totals = []
|
||||
raw_texts = []
|
||||
|
||||
for r in results:
|
||||
if r.get("amount"):
|
||||
totals.append(float(r["amount"]))
|
||||
if r.get("raw_text"):
|
||||
raw_texts.append(r["raw_text"])
|
||||
|
||||
total_formats = defaultdict(int)
|
||||
for text in raw_texts:
|
||||
text_upper = text.upper()
|
||||
|
||||
if re.search(r'TOTAL\s*:?\s*[\d.,]+', text_upper):
|
||||
total_formats["TOTAL:"] += 1
|
||||
if re.search(r'TOTAL\s+DE\s+PLAT', text_upper):
|
||||
total_formats["TOTAL DE PLATA"] += 1
|
||||
if re.search(r'SUMA\s+TOTAL', text_upper):
|
||||
total_formats["SUMA TOTALA"] += 1
|
||||
if re.search(r'GRAND\s*TOTAL', text_upper):
|
||||
total_formats["GRAND TOTAL"] += 1
|
||||
|
||||
return {
|
||||
"count": len(totals),
|
||||
"formats": dict(total_formats),
|
||||
"dominant_format": max(total_formats, key=total_formats.get) if total_formats else "TOTAL"
|
||||
}
|
||||
|
||||
|
||||
def analyze_date_patterns(results: List[Dict]) -> Dict:
|
||||
"""Analyze date patterns from extraction results."""
|
||||
dates = []
|
||||
raw_texts = []
|
||||
|
||||
for r in results:
|
||||
if r.get("receipt_date"):
|
||||
dates.append(r["receipt_date"])
|
||||
if r.get("raw_text"):
|
||||
raw_texts.append(r["raw_text"])
|
||||
|
||||
date_formats = defaultdict(int)
|
||||
for text in raw_texts:
|
||||
# DD.MM.YYYY
|
||||
if re.search(r'\d{2}\.\d{2}\.\d{4}', text):
|
||||
date_formats["DD.MM.YYYY"] += 1
|
||||
# YYYY.MM.DD (OMV/SOCAR style)
|
||||
if re.search(r'\d{4}\.\d{2}\.\d{2}', text):
|
||||
date_formats["YYYY.MM.DD"] += 1
|
||||
# DD-MM-YYYY
|
||||
if re.search(r'\d{2}-\d{2}-\d{4}', text):
|
||||
date_formats["DD-MM-YYYY"] += 1
|
||||
# DD/MM/YYYY
|
||||
if re.search(r'\d{2}/\d{2}/\d{4}', text):
|
||||
date_formats["DD/MM/YYYY"] += 1
|
||||
|
||||
return {
|
||||
"extracted_dates": dates,
|
||||
"formats": dict(date_formats),
|
||||
"dominant_format": max(date_formats, key=date_formats.get) if date_formats else "DD.MM.YYYY"
|
||||
}
|
||||
|
||||
|
||||
def analyze_payment_patterns(results: List[Dict]) -> Dict:
|
||||
"""Analyze payment method patterns."""
|
||||
payment_counts = defaultdict(int)
|
||||
|
||||
for r in results:
|
||||
methods = r.get("payment_methods", [])
|
||||
for m in methods:
|
||||
method_type = m.get("method", "UNKNOWN")
|
||||
payment_counts[method_type] += 1
|
||||
|
||||
return {
|
||||
"methods": dict(payment_counts),
|
||||
"has_mixed_payments": len(payment_counts) > 1
|
||||
}
|
||||
|
||||
|
||||
def analyze_client_patterns(results: List[Dict]) -> Dict:
|
||||
"""Analyze client (B2B) patterns."""
|
||||
has_client_cui = 0
|
||||
has_client_name = 0
|
||||
|
||||
for r in results:
|
||||
if r.get("client_cui"):
|
||||
has_client_cui += 1
|
||||
if r.get("client_name"):
|
||||
has_client_name += 1
|
||||
|
||||
return {
|
||||
"has_client_cui": has_client_cui > 0,
|
||||
"has_client_name": has_client_name > 0,
|
||||
"b2b_ratio": has_client_cui / len(results) if results else 0
|
||||
}
|
||||
|
||||
|
||||
def generate_profile_code(
|
||||
store_name: str,
|
||||
cui: str,
|
||||
tva_analysis: Dict,
|
||||
total_analysis: Dict,
|
||||
date_analysis: Dict,
|
||||
payment_analysis: Dict,
|
||||
client_analysis: Dict
|
||||
) -> str:
|
||||
"""
|
||||
Generate Python profile class code.
|
||||
|
||||
Args:
|
||||
store_name: Human-readable store name
|
||||
cui: CUI number (without RO prefix)
|
||||
*_analysis: Analysis results from pattern detection
|
||||
|
||||
Returns:
|
||||
Python source code for the profile class
|
||||
"""
|
||||
# Generate class name from store name
|
||||
class_name = "".join(
|
||||
word.capitalize()
|
||||
for word in re.sub(r'[^a-zA-Z0-9\s]', '', store_name).split()
|
||||
) + "Profile"
|
||||
|
||||
# Generate module name
|
||||
module_name = re.sub(r'[^a-z0-9]', '_', store_name.lower()).strip('_')
|
||||
|
||||
# Determine profile characteristics
|
||||
is_non_vat = tva_analysis.get("is_non_vat", False)
|
||||
has_multi_rate = tva_analysis.get("has_multi_rate", False)
|
||||
has_client_cui = client_analysis.get("has_client_cui", False)
|
||||
uses_yyyy_mm_dd = date_analysis.get("dominant_format") == "YYYY.MM.DD"
|
||||
|
||||
# Generate OCR name patterns
|
||||
name_words = store_name.upper().split()
|
||||
primary_word = name_words[0] if name_words else store_name.upper()
|
||||
name_patterns = [
|
||||
primary_word,
|
||||
store_name.upper().replace(".", "").replace(",", ""),
|
||||
]
|
||||
# Add OCR error variants
|
||||
ocr_variants = {
|
||||
'O': '0', 'I': '1', 'L': '1', 'S': '5', 'B': '8', 'E': '3'
|
||||
}
|
||||
for char, replacement in ocr_variants.items():
|
||||
if char in primary_word:
|
||||
name_patterns.append(primary_word.replace(char, replacement, 1))
|
||||
|
||||
name_patterns = list(dict.fromkeys(name_patterns))[:4] # Unique, max 4
|
||||
|
||||
# Build the code
|
||||
code_lines = [
|
||||
'"""',
|
||||
f'{store_name} store profile for OCR extraction.',
|
||||
'',
|
||||
'Auto-generated by generate_store_profile.py',
|
||||
f'Generated: {datetime.now().strftime("%Y-%m-%d %H:%M")}',
|
||||
'"""',
|
||||
'',
|
||||
'import re',
|
||||
'from decimal import Decimal, InvalidOperation',
|
||||
'from typing import List, Dict, Any',
|
||||
'',
|
||||
'from .base import BaseStoreProfile',
|
||||
'from . import ProfileRegistry',
|
||||
'',
|
||||
'',
|
||||
'@ProfileRegistry.register',
|
||||
f'class {class_name}(BaseStoreProfile):',
|
||||
' """',
|
||||
f' {store_name} - OCR extraction profile.',
|
||||
' ',
|
||||
]
|
||||
|
||||
# Add characteristics to docstring
|
||||
characteristics = []
|
||||
if is_non_vat:
|
||||
characteristics.append("Non-VAT payer (neplatitor TVA)")
|
||||
if has_multi_rate:
|
||||
characteristics.append("Multi-rate TVA")
|
||||
if has_client_cui:
|
||||
characteristics.append("B2B receipts with client CUI")
|
||||
if uses_yyyy_mm_dd:
|
||||
characteristics.append("Date format: YYYY.MM.DD")
|
||||
|
||||
if characteristics:
|
||||
code_lines.append(' Key characteristics:')
|
||||
for c in characteristics:
|
||||
code_lines.append(f' - {c}')
|
||||
code_lines.append(' ')
|
||||
|
||||
code_lines.extend([
|
||||
' """',
|
||||
'',
|
||||
f' CUI_LIST = ["{cui}"]',
|
||||
f' NAME_PATTERNS = {name_patterns}',
|
||||
f' STORE_NAME = "{store_name}"',
|
||||
'',
|
||||
])
|
||||
|
||||
# Add date patterns override for YYYY.MM.DD format
|
||||
if uses_yyyy_mm_dd:
|
||||
code_lines.extend([
|
||||
' # Override date patterns for YYYY.MM.DD format',
|
||||
' DATE_PATTERNS_OCR_SPACES = [',
|
||||
' r\'(\\d{4})[.,]\\s*(\\d{2})[.,]\\s*(\\d{2})\', # YYYY. MM. DD with spaces',
|
||||
' r\'(\\d{4})[.,](\\d{2})[.,](\\d{2})\', # YYYY.MM.DD',
|
||||
' ]',
|
||||
'',
|
||||
])
|
||||
|
||||
# Add TVA extraction method for multi-rate or non-VAT
|
||||
if is_non_vat:
|
||||
code_lines.extend([
|
||||
' def extract_tva_entries(self, text: str) -> List[dict]:',
|
||||
' """Non-VAT payer - returns empty list."""',
|
||||
' return []',
|
||||
'',
|
||||
])
|
||||
elif has_multi_rate and tva_analysis.get("dominant_format") == "lidl_multi_rate":
|
||||
code_lines.extend([
|
||||
' # Store-specific TVA patterns',
|
||||
' TVA_PATTERNS = [',
|
||||
' r\'T[VU][AR]\\s+([A-D])\\s+(\\d{1,2})[.,]?\\d{0,2}\\s*%\\s+([\\d.,]+)\',',
|
||||
' ]',
|
||||
'',
|
||||
' def extract_tva_entries(self, text: str) -> List[dict]:',
|
||||
' """Extract multi-rate TVA entries."""',
|
||||
' entries = []',
|
||||
' seen = set()',
|
||||
'',
|
||||
' for pattern in self.TVA_PATTERNS:',
|
||||
' for match in re.finditer(pattern, text, re.IGNORECASE):',
|
||||
' try:',
|
||||
' code = match.group(1).upper()',
|
||||
' percent = int(match.group(2))',
|
||||
' amount = self._parse_decimal(match.group(3))',
|
||||
'',
|
||||
' if amount and amount > 0:',
|
||||
' entry_key = (code, percent)',
|
||||
' if entry_key not in seen:',
|
||||
' entries.append({',
|
||||
' \'code\': code,',
|
||||
' \'percent\': percent,',
|
||||
' \'amount\': amount',
|
||||
' })',
|
||||
' seen.add(entry_key)',
|
||||
' except (ValueError, InvalidOperation):',
|
||||
' continue',
|
||||
'',
|
||||
' return entries',
|
||||
'',
|
||||
])
|
||||
|
||||
# Add validation hints method
|
||||
code_lines.extend([
|
||||
' def get_validation_hints(self) -> Dict[str, Any]:',
|
||||
f' """Return {store_name}-specific validation hints."""',
|
||||
' return {',
|
||||
f' "has_multi_rate_tva": {has_multi_rate},',
|
||||
f' "card_equals_total": True,',
|
||||
f' "has_client_cui": {has_client_cui},',
|
||||
f' "has_efactura": False,',
|
||||
f' "is_non_vat_payer": {is_non_vat},',
|
||||
' }',
|
||||
])
|
||||
|
||||
return '\n'.join(code_lines) + '\n'
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Generate store profile from PDF receipts",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Generate profile from a single PDF
|
||||
python scripts/generate_store_profile.py \\
|
||||
--name "Magazin Nou" --cui "12345678" \\
|
||||
--receipts "docs/data-entry/magazin_nou.pdf"
|
||||
|
||||
# Generate profile from multiple PDFs (glob pattern)
|
||||
python scripts/generate_store_profile.py \\
|
||||
--name "Carrefour" --cui "2475489" \\
|
||||
--receipts "docs/data-entry/Carrefour*.pdf" \\
|
||||
--output backend/modules/data_entry/services/ocr/profiles/carrefour.py
|
||||
|
||||
# Dry run (analyze only, don't write file)
|
||||
python scripts/generate_store_profile.py \\
|
||||
--name "Test Store" --cui "11111111" \\
|
||||
--receipts "docs/data-entry/test*.pdf" \\
|
||||
--dry-run
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument("--name", required=True, help="Store name (e.g., 'LIDL DISCOUNT S.R.L.')")
|
||||
parser.add_argument("--cui", required=True, help="CUI number without RO prefix")
|
||||
parser.add_argument("--receipts", required=True, help="PDF file path or glob pattern")
|
||||
parser.add_argument("--output", help="Output file path (default: auto-generated)")
|
||||
parser.add_argument("--dry-run", action="store_true", help="Analyze only, don't write file")
|
||||
parser.add_argument("--api-base", default=API_BASE, help=f"API base URL (default: {API_BASE})")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Update API base if provided
|
||||
api_base = args.api_base
|
||||
|
||||
# Validate CUI format
|
||||
cui = args.cui.strip().replace("RO", "").replace(" ", "")
|
||||
if not cui.isdigit() or len(cui) < 6 or len(cui) > 10:
|
||||
print(f"Error: Invalid CUI format: {args.cui}")
|
||||
sys.exit(1)
|
||||
|
||||
# Find PDF files
|
||||
pdf_files = glob.glob(args.receipts)
|
||||
if not pdf_files:
|
||||
print(f"Error: No PDF files found matching: {args.receipts}")
|
||||
sys.exit(1)
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print(f"Store Profile Generator")
|
||||
print(f"{'='*60}")
|
||||
print(f"Store: {args.name}")
|
||||
print(f"CUI: {cui}")
|
||||
print(f"PDFs: {len(pdf_files)} files")
|
||||
print(f"{'='*60}\n")
|
||||
|
||||
# Generate JWT token
|
||||
token = create_jwt_token()
|
||||
|
||||
# Submit PDFs to OCR
|
||||
print("Step 1: Submitting PDFs to OCR API...")
|
||||
results = []
|
||||
for pdf_path in pdf_files:
|
||||
result = submit_ocr(pdf_path, token, api_base=api_base)
|
||||
if result:
|
||||
results.append(result)
|
||||
|
||||
if not results:
|
||||
print("\nError: No successful extractions. Check if backend is running.")
|
||||
sys.exit(1)
|
||||
|
||||
print(f"\nSuccessfully extracted: {len(results)}/{len(pdf_files)} PDFs")
|
||||
|
||||
# Analyze patterns
|
||||
print("\nStep 2: Analyzing patterns...")
|
||||
tva_analysis = analyze_tva_patterns(results)
|
||||
total_analysis = analyze_total_patterns(results)
|
||||
date_analysis = analyze_date_patterns(results)
|
||||
payment_analysis = analyze_payment_patterns(results)
|
||||
client_analysis = analyze_client_patterns(results)
|
||||
|
||||
print(f" TVA: {tva_analysis['dominant_format']} format, multi-rate={tva_analysis['has_multi_rate']}")
|
||||
print(f" Date: {date_analysis['dominant_format']} format")
|
||||
print(f" Payments: {list(payment_analysis['methods'].keys())}")
|
||||
print(f" B2B: {client_analysis['has_client_cui']}")
|
||||
|
||||
# Generate profile code
|
||||
print("\nStep 3: Generating profile code...")
|
||||
code = generate_profile_code(
|
||||
store_name=args.name,
|
||||
cui=cui,
|
||||
tva_analysis=tva_analysis,
|
||||
total_analysis=total_analysis,
|
||||
date_analysis=date_analysis,
|
||||
payment_analysis=payment_analysis,
|
||||
client_analysis=client_analysis
|
||||
)
|
||||
|
||||
# Determine output path
|
||||
if args.output:
|
||||
output_path = args.output
|
||||
else:
|
||||
module_name = re.sub(r'[^a-z0-9]', '_', args.name.lower()).strip('_')
|
||||
output_path = f"backend/modules/data_entry/services/ocr/profiles/{module_name}.py"
|
||||
|
||||
if args.dry_run:
|
||||
print(f"\n[DRY RUN] Would write to: {output_path}")
|
||||
print(f"\n{'='*60}")
|
||||
print("Generated code:")
|
||||
print(f"{'='*60}")
|
||||
print(code)
|
||||
else:
|
||||
# Write file
|
||||
os.makedirs(os.path.dirname(output_path), exist_ok=True)
|
||||
with open(output_path, 'w') as f:
|
||||
f.write(code)
|
||||
print(f" Written to: {output_path}")
|
||||
|
||||
# Verify syntax
|
||||
import py_compile
|
||||
try:
|
||||
py_compile.compile(output_path, doraise=True)
|
||||
print(f" Syntax check: OK")
|
||||
except py_compile.PyCompileError as e:
|
||||
print(f" Syntax check: FAILED - {e}")
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print("Profile generation complete!")
|
||||
print(f"{'='*60}")
|
||||
|
||||
if not args.dry_run:
|
||||
print(f"\nNext steps:")
|
||||
print(f"1. Review the generated code: {output_path}")
|
||||
print(f"2. Customize patterns if needed")
|
||||
print(f"3. Hot-reload profiles: curl -X POST http://localhost:8000/api/data-entry/ocr/profiles/reload")
|
||||
print(f"4. Test with a sample receipt")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user