feat(ocr): Add modular store profiles with hot-reload support
## Store Profiles System
- Add ProfileRegistry for CUI-based profile lookup
- Add BaseStoreProfile with generic extraction patterns
- Implement hot-reload via POST /api/data-entry/ocr/profiles/reload
## 12 Store Profiles
- LIDL: Multi-rate TVA (A, B, C, D codes)
- OMV, SOCAR: B2B with client CUI, YYYY.MM.DD dates
- BRICK, DEDEMAN: Standard TVA, e-factura support
- KINETERRA, BEST PRINT: Non-VAT payers (returns [])
- STEPOUT MARKET: TVA 5% (books/reduced rate)
- UNLIMITED KEYS: NUMERAR payment detection
- GAMA INK, ELECTROBERING, PICTUS VELUM: Standard TVA
## Flexible TVA Patterns
- All patterns use (\d{1,2})% to accept any rate
- Supports historical (19%, 9%, 5%) and current (21%, 11%)
## Payment Methods Fix
- Fixed base.py to support multiple payments of same type
- Changed deduplication from method-only to (method, amount) tuple
- Returns separate entries for split payments
## Tools
- Add generate_store_profile.py for automatic profile generation
- Analyzes PDFs via OCR API and detects patterns
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
106
.claude/rules/claude-learn-backend.md
Normal file
106
.claude/rules/claude-learn-backend.md
Normal file
@@ -0,0 +1,106 @@
|
|||||||
|
# Claude Learn: Backend
|
||||||
|
|
||||||
|
**Domain**: backend
|
||||||
|
**Last updated**: 2026-01-06
|
||||||
|
**Sessions recorded**: 1
|
||||||
|
|
||||||
|
Knowledge about FastAPI, Python services, Oracle DB, and backend architecture.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Patterns
|
||||||
|
|
||||||
|
### ProfileRegistry cu Hot-Reload pentru Store Profiles
|
||||||
|
**Discovered**: 2026-01-06 (feature: ocr-store-profiles)
|
||||||
|
**Description**: Sistem de înregistrare profile OCR folosind decorator `@ProfileRegistry.register` cu hot-reload via `importlib.reload()`. Permite adăugarea/modificarea profilelor fără restart server.
|
||||||
|
|
||||||
|
**Example** (`backend/modules/data_entry/services/ocr/profiles/__init__.py`):
|
||||||
|
```python
|
||||||
|
class ProfileRegistry:
|
||||||
|
_profiles: Dict[str, Type["BaseStoreProfile"]] = {}
|
||||||
|
_instances: Dict[str, "BaseStoreProfile"] = {}
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def register(cls, profile_class):
|
||||||
|
"""Decorator to register a store profile class."""
|
||||||
|
for cui in profile_class.CUI_LIST:
|
||||||
|
cls._profiles[cls._normalize_cui(cui)] = profile_class
|
||||||
|
return profile_class
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def reload_all(cls):
|
||||||
|
"""Hot-reload all profile modules via importlib.reload()."""
|
||||||
|
cls._instances.clear()
|
||||||
|
for module_name in cls._get_profile_module_names():
|
||||||
|
importlib.reload(sys.modules[f"backend...profiles.{module_name}"])
|
||||||
|
```
|
||||||
|
|
||||||
|
**Usage**:
|
||||||
|
```python
|
||||||
|
@ProfileRegistry.register
|
||||||
|
class LidlProfile(BaseStoreProfile):
|
||||||
|
CUI_LIST = ["22891860"]
|
||||||
|
STORE_NAME = "LIDL DISCOUNT S.R.L."
|
||||||
|
|
||||||
|
# Lookup
|
||||||
|
profile = ProfileRegistry.get_profile("22891860")
|
||||||
|
|
||||||
|
# Hot-reload (endpoint)
|
||||||
|
POST /api/data-entry/ocr/profiles/reload
|
||||||
|
```
|
||||||
|
|
||||||
|
**Tags**: registry-pattern, hot-reload, decorator, ocr, singleton
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Script generare cod Python din analiză PDF
|
||||||
|
**Discovered**: 2026-01-06 (feature: ocr-store-profiles)
|
||||||
|
**Description**: Script care analizează PDF-uri via OCR API, detectează pattern-uri (TVA format, date format, payment) și generează automat cod Python pentru profile noi. Include JWT auth, async polling, și verificare sintaxă.
|
||||||
|
|
||||||
|
**Example** (`scripts/generate_store_profile.py`):
|
||||||
|
```python
|
||||||
|
def analyze_tva_patterns(results: List[Dict]) -> Dict:
|
||||||
|
"""Detectează format TVA dominant din rezultatele OCR."""
|
||||||
|
tva_formats = defaultdict(int)
|
||||||
|
for text in raw_texts:
|
||||||
|
if re.search(r'TVA\s+[A-D]\s+\d{1,2}', text_upper):
|
||||||
|
tva_formats["lidl_multi_rate"] += 1
|
||||||
|
if re.search(r'BAZA\s+TVA', text_upper):
|
||||||
|
tva_formats["table"] += 1
|
||||||
|
return {"dominant_format": max(tva_formats, key=tva_formats.get)}
|
||||||
|
|
||||||
|
def generate_profile_code(store_name, cui, tva_analysis, ...):
|
||||||
|
"""Generează cod Python pentru clasa de profil."""
|
||||||
|
# Template-based generation cu OCR error variants
|
||||||
|
```
|
||||||
|
|
||||||
|
**Usage**:
|
||||||
|
```bash
|
||||||
|
# Dry-run pentru preview
|
||||||
|
python scripts/generate_store_profile.py \
|
||||||
|
--name "Magazin Nou" --cui "12345678" \
|
||||||
|
--receipts "docs/data-entry/MagazinNou*.pdf" --dry-run
|
||||||
|
|
||||||
|
# Generează și salvează
|
||||||
|
python scripts/generate_store_profile.py \
|
||||||
|
--name "Magazin Nou" --cui "12345678" \
|
||||||
|
--receipts "docs/data-entry/MagazinNou*.pdf" \
|
||||||
|
--output backend/.../profiles/magazin_nou.py
|
||||||
|
```
|
||||||
|
|
||||||
|
**Tags**: code-generation, ocr, automation, cli-tool
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Gotchas
|
||||||
|
|
||||||
|
_(None recorded yet)_
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Statistics
|
||||||
|
|
||||||
|
- **Total Patterns**: 2
|
||||||
|
- **Total Gotchas**: 0
|
||||||
|
- **Last Session**: 2026-01-06
|
||||||
|
- **Sessions Recorded**: 1
|
||||||
28
.claude/rules/claude-learn-database.md
Normal file
28
.claude/rules/claude-learn-database.md
Normal file
@@ -0,0 +1,28 @@
|
|||||||
|
# Claude Learn: Database
|
||||||
|
|
||||||
|
**Domain**: database
|
||||||
|
**Last updated**: -
|
||||||
|
**Sessions recorded**: 0
|
||||||
|
|
||||||
|
Knowledge about Oracle DB, SQLite, SQLModel, migrations, and data modeling.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Patterns
|
||||||
|
|
||||||
|
_(None recorded yet)_
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Gotchas
|
||||||
|
|
||||||
|
_(None recorded yet)_
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Statistics
|
||||||
|
|
||||||
|
- **Total Patterns**: 0
|
||||||
|
- **Total Gotchas**: 0
|
||||||
|
- **Last Session**: -
|
||||||
|
- **Sessions Recorded**: 0
|
||||||
55
.claude/rules/claude-learn-deployment.md
Normal file
55
.claude/rules/claude-learn-deployment.md
Normal file
@@ -0,0 +1,55 @@
|
|||||||
|
# Claude Learn: Deployment
|
||||||
|
|
||||||
|
**Domain**: deployment
|
||||||
|
**Last updated**: 2026-01-06
|
||||||
|
**Sessions recorded**: 1
|
||||||
|
|
||||||
|
Knowledge about IIS, Docker, deployment scripts, and infrastructure.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Patterns
|
||||||
|
|
||||||
|
### IIS URL Rewrite Rules for SPA with Multiple API Backends
|
||||||
|
**Discovered**: 2025-12-22 (feature: unified-app)
|
||||||
|
**Description**: Configure IIS web.config to proxy different API paths to different backend ports while serving SPA for all other routes. Enables single IIS site to route to multiple microservices.
|
||||||
|
|
||||||
|
**Example** (`public/web.config:5-28`):
|
||||||
|
```xml
|
||||||
|
<rewrite>
|
||||||
|
<rules>
|
||||||
|
<rule name="Proxy Reports API" stopProcessing="true">
|
||||||
|
<match url="^api/reports/(.*)" />
|
||||||
|
<action type="Rewrite" url="http://localhost:8001/api/{R:1}" />
|
||||||
|
</rule>
|
||||||
|
<rule name="Proxy Data Entry API" stopProcessing="true">
|
||||||
|
<match url="^api/data-entry/(.*)" />
|
||||||
|
<action type="Rewrite" url="http://localhost:8003/api/{R:1}" />
|
||||||
|
</rule>
|
||||||
|
<rule name="SPA Fallback" stopProcessing="true">
|
||||||
|
<match url=".*" />
|
||||||
|
<conditions logicalGrouping="MatchAll">
|
||||||
|
<add input="{REQUEST_FILENAME}" matchType="IsFile" negate="true" />
|
||||||
|
</conditions>
|
||||||
|
<action type="Rewrite" url="/index.html" />
|
||||||
|
</rule>
|
||||||
|
</rules>
|
||||||
|
</rewrite>
|
||||||
|
```
|
||||||
|
|
||||||
|
**Tags**: iis, deployment, spa, microservices, proxy
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Gotchas
|
||||||
|
|
||||||
|
_(None recorded yet)_
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Statistics
|
||||||
|
|
||||||
|
- **Total Patterns**: 1
|
||||||
|
- **Total Gotchas**: 0
|
||||||
|
- **Last Session**: 2026-01-06
|
||||||
|
- **Sessions Recorded**: 1
|
||||||
83
.claude/rules/claude-learn-domains.md
Normal file
83
.claude/rules/claude-learn-domains.md
Normal file
@@ -0,0 +1,83 @@
|
|||||||
|
# Claude Learn Domains Configuration
|
||||||
|
|
||||||
|
**Last updated**: 2026-01-06
|
||||||
|
|
||||||
|
This file defines available knowledge domains and their file path patterns.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Domains
|
||||||
|
|
||||||
|
### frontend
|
||||||
|
**File**: `claude-learn-frontend.md`
|
||||||
|
**Patterns**:
|
||||||
|
- `src/**/*.vue`
|
||||||
|
- `src/**/*.js`
|
||||||
|
- `src/**/*.ts`
|
||||||
|
- `src/**/*.css`
|
||||||
|
- `vite.config.*`
|
||||||
|
- `package.json`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### backend
|
||||||
|
**File**: `claude-learn-backend.md`
|
||||||
|
**Patterns**:
|
||||||
|
- `backend/**/*.py`
|
||||||
|
- `backend/modules/**/*`
|
||||||
|
- `requirements.txt`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### database
|
||||||
|
**File**: `claude-learn-database.md`
|
||||||
|
**Patterns**:
|
||||||
|
- `**/*.sql`
|
||||||
|
- `**/models.py`
|
||||||
|
- `**/schemas.py`
|
||||||
|
- `backend/**/db/**/*`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### testing
|
||||||
|
**File**: `claude-learn-testing.md`
|
||||||
|
**Patterns**:
|
||||||
|
- `tests/**/*`
|
||||||
|
- `**/*.test.*`
|
||||||
|
- `**/*.spec.*`
|
||||||
|
- `pytest.ini`
|
||||||
|
- `vitest.config.*`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### deployment
|
||||||
|
**File**: `claude-learn-deployment.md`
|
||||||
|
**Patterns**:
|
||||||
|
- `deployment/**/*`
|
||||||
|
- `public/web.config`
|
||||||
|
- `Dockerfile*`
|
||||||
|
- `docker-compose*.yml`
|
||||||
|
- `*.sh`
|
||||||
|
- `ansible/**/*`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### global
|
||||||
|
**File**: `claude-learn-global.md`
|
||||||
|
**Patterns**:
|
||||||
|
- `*` (catch-all for cross-cutting concerns)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Statistics
|
||||||
|
|
||||||
|
| Domain | Patterns | Gotchas | Last Updated |
|
||||||
|
|--------|----------|---------|--------------|
|
||||||
|
| frontend | 8 | 10 | 2026-01-06 |
|
||||||
|
| deployment | 1 | 0 | 2026-01-06 |
|
||||||
|
| global | 0 | 1 | 2026-01-06 |
|
||||||
|
| backend | 2 | 0 | 2026-01-06 |
|
||||||
|
| database | 0 | 0 | - |
|
||||||
|
| testing | 0 | 0 | - |
|
||||||
|
|
||||||
|
**Total**: 11 patterns, 11 gotchas across 4 domains
|
||||||
@@ -1,9 +1,10 @@
|
|||||||
# Learned Patterns & Gotchas
|
# Claude Learn: Frontend
|
||||||
|
|
||||||
**Last updated**: 2025-12-24
|
**Domain**: frontend
|
||||||
**Maintained**: Manually (add new patterns/gotchas as discovered)
|
**Last updated**: 2026-01-06
|
||||||
|
**Sessions recorded**: 3
|
||||||
|
|
||||||
This file contains insights learned during feature implementations. Claude Code auto-loads this file to prevent repeating past mistakes.
|
Knowledge about Vue.js, Vite, Pinia, CSS, and frontend architecture.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -130,37 +131,6 @@ resolve: {
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### IIS URL Rewrite Rules for SPA with Multiple API Backends
|
|
||||||
**Discovered**: 2025-12-22 (feature: unified-app)
|
|
||||||
**Description**: Configure IIS web.config to proxy different API paths to different backend ports while serving SPA for all other routes. Enables single IIS site to route to multiple microservices.
|
|
||||||
|
|
||||||
**Example** (`public/web.config:5-28`):
|
|
||||||
```xml
|
|
||||||
<rewrite>
|
|
||||||
<rules>
|
|
||||||
<rule name="Proxy Reports API" stopProcessing="true">
|
|
||||||
<match url="^api/reports/(.*)" />
|
|
||||||
<action type="Rewrite" url="http://localhost:8001/api/{R:1}" />
|
|
||||||
</rule>
|
|
||||||
<rule name="Proxy Data Entry API" stopProcessing="true">
|
|
||||||
<match url="^api/data-entry/(.*)" />
|
|
||||||
<action type="Rewrite" url="http://localhost:8003/api/{R:1}" />
|
|
||||||
</rule>
|
|
||||||
<rule name="SPA Fallback" stopProcessing="true">
|
|
||||||
<match url=".*" />
|
|
||||||
<conditions logicalGrouping="MatchAll">
|
|
||||||
<add input="{REQUEST_FILENAME}" matchType="IsFile" negate="true" />
|
|
||||||
</conditions>
|
|
||||||
<action type="Rewrite" url="/index.html" />
|
|
||||||
</rule>
|
|
||||||
</rules>
|
|
||||||
</rewrite>
|
|
||||||
```
|
|
||||||
|
|
||||||
**Tags**: iis, deployment, spa, microservices, proxy
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Vue Watcher for Auto-Loading Dependent Data
|
### Vue Watcher for Auto-Loading Dependent Data
|
||||||
**Discovered**: 2025-12-24 (feature: unified-app-ux)
|
**Discovered**: 2025-12-24 (feature: unified-app-ux)
|
||||||
**Description**: Use Vue watch() to automatically trigger data loading when dependent selections change. Watch company selection changes to auto-load accounting periods, ensuring UI stays synchronized without manual intervention.
|
**Description**: Use Vue watch() to automatically trigger data loading when dependent selections change. Watch company selection changes to auto-load accounting periods, ensuring UI stays synchronized without manual intervention.
|
||||||
@@ -248,15 +218,6 @@ const getStorageKey = () => {
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### Sed Command Quote Mismatch in Bulk Find-Replace
|
|
||||||
**Discovered**: 2025-12-22 (feature: unified-app)
|
|
||||||
**Problem**: Bulk sed commands using single quotes in pattern didn't match imports using double quotes, and vice versa. Commands like sed 's|from '@/stores/'|...' didn't replace from "@/stores/" lines.
|
|
||||||
**Solution**: Always use the quote style that matches the target files. For Vue/JS files with ESLint using double quotes, use double quotes in sed patterns. Better yet: use find -exec with separate sed for each file to handle both quote styles.
|
|
||||||
|
|
||||||
**Tags**: sed, regex, scripting, find-replace, migration
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Circular Reference in API Wrapper
|
### Circular Reference in API Wrapper
|
||||||
**Discovered**: 2025-12-22 (feature: unified-app)
|
**Discovered**: 2025-12-22 (feature: unified-app)
|
||||||
**Problem**: receiptsStore.js failed to build with 'Identifier api has already been declared' because it imported api and then declared const api = { ... } wrapper object using the same name.
|
**Problem**: receiptsStore.js failed to build with 'Identifier api has already been declared' because it imported api and then declared const api = { ... } wrapper object using the same name.
|
||||||
@@ -287,7 +248,7 @@ const getStorageKey = () => {
|
|||||||
### Vite Build Transform Count is Progress Indicator
|
### Vite Build Transform Count is Progress Indicator
|
||||||
**Discovered**: 2025-12-22 (feature: unified-app)
|
**Discovered**: 2025-12-22 (feature: unified-app)
|
||||||
**Problem**: Hard to tell if build is making progress when fixing import issues. Each fix revealed new errors, causing frustration.
|
**Problem**: Hard to tell if build is making progress when fixing import issues. Each fix revealed new errors, causing frustration.
|
||||||
**Solution**: Watch the 'transforming... ✓ N modules transformed' count - it increases with each successful fix even if build ultimately fails. Going from 200→573→1490→1492 modules meant we were getting close to success. Use this as encouragement!
|
**Solution**: Watch the 'transforming... N modules transformed' count - it increases with each successful fix even if build ultimately fails. Going from 200->573->1490->1492 modules meant we were getting close to success. Use this as encouragement!
|
||||||
|
|
||||||
**Tags**: vite, build, debugging, progress-tracking, developer-experience
|
**Tags**: vite, build, debugging, progress-tracking, developer-experience
|
||||||
|
|
||||||
@@ -329,9 +290,9 @@ const getStorageKey = () => {
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Memory Statistics
|
## Statistics
|
||||||
|
|
||||||
- **Total Patterns**: 9
|
- **Total Patterns**: 8
|
||||||
- **Total Gotchas**: 11
|
- **Total Gotchas**: 10
|
||||||
- **Last Session**: 2025-12-24 (unified-app-ux)
|
- **Last Session**: 2026-01-06
|
||||||
- **Sessions Recorded**: 2
|
- **Sessions Recorded**: 3
|
||||||
33
.claude/rules/claude-learn-global.md
Normal file
33
.claude/rules/claude-learn-global.md
Normal file
@@ -0,0 +1,33 @@
|
|||||||
|
# Claude Learn: Global
|
||||||
|
|
||||||
|
**Domain**: global
|
||||||
|
**Last updated**: 2026-01-06
|
||||||
|
**Sessions recorded**: 1
|
||||||
|
|
||||||
|
Cross-cutting knowledge applicable to multiple domains (scripting, tooling, workflow).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Patterns
|
||||||
|
|
||||||
|
_(None recorded yet)_
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Gotchas
|
||||||
|
|
||||||
|
### Sed Command Quote Mismatch in Bulk Find-Replace
|
||||||
|
**Discovered**: 2025-12-22 (feature: unified-app)
|
||||||
|
**Problem**: Bulk sed commands using single quotes in pattern didn't match imports using double quotes, and vice versa. Commands like sed 's|from '@/stores/'|...' didn't replace from "@/stores/" lines.
|
||||||
|
**Solution**: Always use the quote style that matches the target files. For Vue/JS files with ESLint using double quotes, use double quotes in sed patterns. Better yet: use find -exec with separate sed for each file to handle both quote styles.
|
||||||
|
|
||||||
|
**Tags**: sed, regex, scripting, find-replace, migration
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Statistics
|
||||||
|
|
||||||
|
- **Total Patterns**: 0
|
||||||
|
- **Total Gotchas**: 1
|
||||||
|
- **Last Session**: 2026-01-06
|
||||||
|
- **Sessions Recorded**: 1
|
||||||
28
.claude/rules/claude-learn-testing.md
Normal file
28
.claude/rules/claude-learn-testing.md
Normal file
@@ -0,0 +1,28 @@
|
|||||||
|
# Claude Learn: Testing
|
||||||
|
|
||||||
|
**Domain**: testing
|
||||||
|
**Last updated**: -
|
||||||
|
**Sessions recorded**: 0
|
||||||
|
|
||||||
|
Knowledge about pytest, Vitest, test patterns, and validation strategies.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Patterns
|
||||||
|
|
||||||
|
_(None recorded yet)_
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Gotchas
|
||||||
|
|
||||||
|
_(None recorded yet)_
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Statistics
|
||||||
|
|
||||||
|
- **Total Patterns**: 0
|
||||||
|
- **Total Gotchas**: 0
|
||||||
|
- **Last Session**: -
|
||||||
|
- **Sessions Recorded**: 0
|
||||||
@@ -628,3 +628,86 @@ def _dict_to_extraction_data(data: dict) -> ExtractionData:
|
|||||||
validation_errors=data.get('validation_errors', []),
|
validation_errors=data.get('validation_errors', []),
|
||||||
inter_ocr_ratios=data.get('inter_ocr_ratios', {}),
|
inter_ocr_ratios=data.get('inter_ocr_ratios', {}),
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ============================================================================
|
||||||
|
# Store Profiles Management Endpoints
|
||||||
|
# ============================================================================
|
||||||
|
|
||||||
|
@router.post("/profiles/reload")
|
||||||
|
async def reload_store_profiles(
|
||||||
|
current_user: CurrentUser = Depends(get_current_user)
|
||||||
|
) -> dict:
|
||||||
|
"""
|
||||||
|
Hot-reload all store profiles.
|
||||||
|
|
||||||
|
Reloads profile Python modules without server restart.
|
||||||
|
Use after adding/modifying profile files.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict with reloaded count and profile list
|
||||||
|
"""
|
||||||
|
from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
|
||||||
|
|
||||||
|
count = ProfileRegistry.reload_all()
|
||||||
|
status = ProfileRegistry.get_reload_status()
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"reloaded_modules": count,
|
||||||
|
"profiles_count": status["profiles_count"],
|
||||||
|
"registered_cuis": status["registered_cuis"],
|
||||||
|
"last_reload": status["last_reload"],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/profiles")
|
||||||
|
async def list_store_profiles(
|
||||||
|
current_user: CurrentUser = Depends(get_current_user)
|
||||||
|
) -> dict:
|
||||||
|
"""
|
||||||
|
List all registered store profiles.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict with profiles list and status
|
||||||
|
"""
|
||||||
|
from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
|
||||||
|
|
||||||
|
profiles = ProfileRegistry.list_profiles()
|
||||||
|
status = ProfileRegistry.get_reload_status()
|
||||||
|
|
||||||
|
return {
|
||||||
|
"profiles": profiles,
|
||||||
|
"count": len(profiles),
|
||||||
|
"last_reload": status["last_reload"],
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@router.get("/profiles/{cui}")
|
||||||
|
async def get_store_profile(
|
||||||
|
cui: str,
|
||||||
|
current_user: CurrentUser = Depends(get_current_user)
|
||||||
|
) -> dict:
|
||||||
|
"""
|
||||||
|
Get details for a specific store profile.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
cui: Store CUI (with or without RO prefix)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Profile details including validation hints
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
404: If no profile exists for this CUI
|
||||||
|
"""
|
||||||
|
from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
|
||||||
|
|
||||||
|
info = ProfileRegistry.get_profile_info(cui)
|
||||||
|
|
||||||
|
if not info:
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=404,
|
||||||
|
detail=f"No profile registered for CUI: {cui}"
|
||||||
|
)
|
||||||
|
|
||||||
|
return info
|
||||||
|
|||||||
258
backend/modules/data_entry/services/ocr/profiles/README.md
Normal file
258
backend/modules/data_entry/services/ocr/profiles/README.md
Normal file
@@ -0,0 +1,258 @@
|
|||||||
|
# Store Profiles - OCR Extraction
|
||||||
|
|
||||||
|
Sistem de profile specifice pentru extracție OCR cu hot-reload.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Start: Adaugă un profil nou
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Generează profil din PDF-uri (dry-run pentru preview)
|
||||||
|
python scripts/generate_store_profile.py \
|
||||||
|
--name "Magazin Nou SRL" \
|
||||||
|
--cui "12345678" \
|
||||||
|
--receipts "docs/data-entry/MagazinNou*.pdf" \
|
||||||
|
--dry-run
|
||||||
|
|
||||||
|
# 2. Generează și salvează
|
||||||
|
python scripts/generate_store_profile.py \
|
||||||
|
--name "Magazin Nou SRL" \
|
||||||
|
--cui "12345678" \
|
||||||
|
--receipts "docs/data-entry/MagazinNou*.pdf" \
|
||||||
|
--output backend/modules/data_entry/services/ocr/profiles/magazin_nou.py
|
||||||
|
|
||||||
|
# 3. Hot-reload (fără restart server)
|
||||||
|
curl -X POST http://localhost:8000/api/data-entry/ocr/profiles/reload
|
||||||
|
|
||||||
|
# 4. Verifică
|
||||||
|
curl http://localhost:8000/api/data-entry/ocr/profiles
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Structura directorului
|
||||||
|
|
||||||
|
```
|
||||||
|
profiles/
|
||||||
|
├── __init__.py # ProfileRegistry + hot-reload (~390 linii)
|
||||||
|
├── base.py # BaseStoreProfile + pattern-uri generice (~410 linii)
|
||||||
|
├── lidl.py # Multi-rate TVA (A/B)
|
||||||
|
├── omv.py # B2B, date YYYY.MM.DD
|
||||||
|
├── socar.py # B2B, date YYYY.MM.DD
|
||||||
|
├── brick.py # Standard TVA
|
||||||
|
├── dedeman.py # E-factura support
|
||||||
|
├── kineterra.py # Non-VAT payer
|
||||||
|
├── gama_ink.py # Standard TVA (toner/cartușe)
|
||||||
|
├── electrobering.py # Standard TVA (electronice)
|
||||||
|
├── pictus_velum.py # Standard TVA (rechizite)
|
||||||
|
├── unlimited_keys.py # Standard TVA, NUMERAR payment
|
||||||
|
├── best_print.py # Non-VAT payer (neplătitor TVA)
|
||||||
|
├── stepout_market.py # TVA 5% (cărți/librărie)
|
||||||
|
└── README.md # Acest fișier
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Profile existente (12 profile)
|
||||||
|
|
||||||
|
> **Note**: Pattern-urile TVA sunt **flexibile** și acceptă ORICE cotă (5%, 9%, 11%, 19%, 21%, etc.)
|
||||||
|
> pentru a gestiona atât datele istorice cât și schimbările viitoare ale legislației.
|
||||||
|
|
||||||
|
| Magazin | CUI | Fișier | Caracteristici |
|
||||||
|
|---------|-----|--------|----------------|
|
||||||
|
| LIDL DISCOUNT S.R.L. | 22891860 | `lidl.py` | Multi-rate TVA (coduri A, B, C, D) |
|
||||||
|
| OMV PETROM MARKETING S.R.L. | 11201891 | `omv.py` | B2B (client CUI), date YYYY.MM.DD |
|
||||||
|
| SOCAR PETROLEUM S.A. | 12546600 | `socar.py` | B2B (client CUI), date YYYY.MM.DD |
|
||||||
|
| FIVE-HOLDING S.A. (BRICK) | 10562600 | `brick.py` | Standard TVA |
|
||||||
|
| DEDEMAN SRL | 2816464 | `dedeman.py` | E-factura support |
|
||||||
|
| KINETERRA CONCEPT SRL | 31180432 | `kineterra.py` | Non-VAT payer (returnează `[]`) |
|
||||||
|
| GAMA INK SERVICE SRL | 17741882 | `gama_ink.py` | Standard TVA (toner, cartușe) |
|
||||||
|
| ELECTROBERING S.R.L. | 2744937 | `electrobering.py` | Standard TVA (electronice) |
|
||||||
|
| PICTUS VELUM SRL | 39634534 | `pictus_velum.py` | Standard TVA (rechizite) |
|
||||||
|
| UNLIMITED KEYS S.R.L. | 18993187 | `unlimited_keys.py` | Standard TVA, **NUMERAR** plată |
|
||||||
|
| BEST PRINT TRADE ACTIV SRL | 45417955 | `best_print.py` | **Non-VAT payer** (neplătitor TVA) |
|
||||||
|
| STEPOUT MARKET SRL | 35532655 | `stepout_market.py` | TVA 5% (cărți, librărie) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## API Endpoints
|
||||||
|
|
||||||
|
| Endpoint | Metodă | Descriere |
|
||||||
|
|----------|--------|-----------|
|
||||||
|
| `/api/data-entry/ocr/profiles` | GET | Lista toate profilele |
|
||||||
|
| `/api/data-entry/ocr/profiles/{cui}` | GET | Detalii profil (acceptă RO prefix) |
|
||||||
|
| `/api/data-entry/ocr/profiles/reload` | POST | Hot-reload toate profilele |
|
||||||
|
|
||||||
|
### Exemple API
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Lista profile
|
||||||
|
curl http://localhost:8000/api/data-entry/ocr/profiles \
|
||||||
|
-H "Authorization: Bearer <token>"
|
||||||
|
|
||||||
|
# Detalii profil (cu sau fără RO prefix)
|
||||||
|
curl http://localhost:8000/api/data-entry/ocr/profiles/22891860
|
||||||
|
curl http://localhost:8000/api/data-entry/ocr/profiles/RO22891860
|
||||||
|
|
||||||
|
# Hot-reload după modificări
|
||||||
|
curl -X POST http://localhost:8000/api/data-entry/ocr/profiles/reload \
|
||||||
|
-H "Authorization: Bearer <token>"
|
||||||
|
|
||||||
|
# Response reload:
|
||||||
|
{
|
||||||
|
"success": true,
|
||||||
|
"reloaded_modules": 12,
|
||||||
|
"profiles_count": 12,
|
||||||
|
"registered_cuis": ["22891860", "11201891", "12546600", "10562600", ...],
|
||||||
|
"last_reload": "2026-01-06T22:37:05.000000"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cum funcționează sistemul
|
||||||
|
|
||||||
|
### Flow de extracție
|
||||||
|
|
||||||
|
```
|
||||||
|
ReceiptExtractor.extract()
|
||||||
|
│
|
||||||
|
├─► STEP 1: Extrage vendor + CUI
|
||||||
|
│ └─► _extract_vendor(), _extract_cui()
|
||||||
|
│
|
||||||
|
├─► ProfileRegistry.get_profile(cui)
|
||||||
|
│ └─► Returnează profil specific sau None
|
||||||
|
│
|
||||||
|
├─► STEP 2: Extracție cu profil (dacă există)
|
||||||
|
│ ├─► profile.extract_total()
|
||||||
|
│ ├─► profile.extract_date()
|
||||||
|
│ ├─► profile.extract_receipt_number()
|
||||||
|
│ ├─► profile.extract_tva_entries()
|
||||||
|
│ ├─► profile.extract_payment_methods()
|
||||||
|
│ └─► profile.extract_client_cui()
|
||||||
|
│
|
||||||
|
└─► STEP 3-4: Validare + post-procesare
|
||||||
|
```
|
||||||
|
|
||||||
|
### Fallback
|
||||||
|
|
||||||
|
Dacă nu există profil pentru CUI, se folosește logica generică din `ReceiptExtractor`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Structura unui profil
|
||||||
|
|
||||||
|
```python
|
||||||
|
from .base import BaseStoreProfile
|
||||||
|
from . import ProfileRegistry
|
||||||
|
|
||||||
|
@ProfileRegistry.register
|
||||||
|
class MagazinNouProfile(BaseStoreProfile):
|
||||||
|
"""Docstring cu descriere magazin."""
|
||||||
|
|
||||||
|
CUI_LIST = ["12345678"] # Poate avea mai multe CUI-uri
|
||||||
|
NAME_PATTERNS = ["MAGAZIN", "MAGAZIN NOU", "MAG4ZIN"] # OCR variants
|
||||||
|
STORE_NAME = "Magazin Nou SRL"
|
||||||
|
|
||||||
|
# Override doar ce e diferit de base class
|
||||||
|
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||||
|
# Pattern-uri specifice magazinului
|
||||||
|
...
|
||||||
|
|
||||||
|
def get_validation_hints(self) -> Dict[str, Any]:
|
||||||
|
return {
|
||||||
|
"has_multi_rate_tva": False,
|
||||||
|
"card_equals_total": True,
|
||||||
|
"has_client_cui": False,
|
||||||
|
"has_efactura": False,
|
||||||
|
"is_non_vat_payer": False,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Pattern-uri disponibile în base.py
|
||||||
|
|
||||||
|
BaseStoreProfile include pattern-uri generice OCR-tolerant:
|
||||||
|
|
||||||
|
| Pattern | Descriere |
|
||||||
|
|---------|-----------|
|
||||||
|
| `TOTAL_PATTERNS` | 8 variante pentru TOTAL (TOTAL:, TOTAL DE PLATA, etc.) |
|
||||||
|
| `DATE_PATTERNS` | 6 variante (DD.MM.YYYY, YYYY-MM-DD, DD/MM/YYYY) |
|
||||||
|
| `DATE_PATTERNS_OCR_SPACES` | 4 variante cu spații OCR ("2025. 08. 14") |
|
||||||
|
| `NUMBER_PATTERNS` | 11 variante pentru număr bon (NDS, BF, C3POS) |
|
||||||
|
| `PAYMENT_PATTERNS` | 8 variante pentru CARD/NUMERAR |
|
||||||
|
| `CLIENT_MARKERS` | 6 variante pentru secțiune CLIENT |
|
||||||
|
| `CLIENT_CUI_PATTERNS` | 7 variante pentru CUI client |
|
||||||
|
|
||||||
|
### Metode implementate în base class
|
||||||
|
|
||||||
|
- `extract_total(text)` → `Tuple[Decimal, float]`
|
||||||
|
- `extract_date(text)` → `Tuple[date, float]`
|
||||||
|
- `extract_receipt_number(text)` → `Tuple[str, float]`
|
||||||
|
- `extract_payment_methods(text)` → `List[dict]`
|
||||||
|
- `extract_client_cui(text)` → `Tuple[str, float]`
|
||||||
|
- `extract_client_name(text)` → `Tuple[str, float]`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Când ai nevoie de profil custom?
|
||||||
|
|
||||||
|
| Situație | Exemplu | Ce trebuie override |
|
||||||
|
|----------|---------|---------------------|
|
||||||
|
| **Multi-rate TVA** | Lidl (TVA A, TVA B) | `extract_tva_entries()` |
|
||||||
|
| **Format dată special** | OMV/Socar (YYYY.MM.DD) | `DATE_PATTERNS_OCR_SPACES` |
|
||||||
|
| **B2B receipts** | Benzinării (au client CUI) | `extract_client_cui()` |
|
||||||
|
| **Non-VAT payer** | Kineterra | `extract_tva_entries()` returnează `[]` |
|
||||||
|
| **E-factura** | Dedeman | `extract_efactura_reference()` |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decizii de design
|
||||||
|
|
||||||
|
1. **Hot-reload manual** - endpoint `/profiles/reload` apelat când se modifică fișiere
|
||||||
|
2. **Persistență în Python** - profile în Git, version controlled
|
||||||
|
3. **Fallback graceful** - dacă nu există profil, folosește logica generică
|
||||||
|
4. **CUI normalization** - gestionează automat prefixul "RO" și whitespace
|
||||||
|
5. **Deduplicare TVA** - folosește `seen = set()` pentru a evita duplicate
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Comenzi utile
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Verifică syntax Python pentru toate profilele
|
||||||
|
for f in backend/modules/data_entry/services/ocr/profiles/*.py; do
|
||||||
|
python3 -m py_compile "$f" && echo "✓ $(basename $f)"
|
||||||
|
done
|
||||||
|
|
||||||
|
# Lista profile
|
||||||
|
ls -la backend/modules/data_entry/services/ocr/profiles/
|
||||||
|
|
||||||
|
# Pornește backend pentru testare
|
||||||
|
cd backend && source venv/bin/activate
|
||||||
|
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 1
|
||||||
|
|
||||||
|
# Test OCR pe un PDF
|
||||||
|
curl -X POST -F "file=@docs/data-entry/test.pdf" \
|
||||||
|
-H "Authorization: Bearer <token>" \
|
||||||
|
"http://localhost:8000/api/data-entry/ocr/extract?engine=doctr_plus"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Script generare profile
|
||||||
|
|
||||||
|
`scripts/generate_store_profile.py` - generator automat de profile
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Vezi help
|
||||||
|
python scripts/generate_store_profile.py --help
|
||||||
|
|
||||||
|
# Funcționalități:
|
||||||
|
# - Analizează PDF-uri via OCR API
|
||||||
|
# - Detectează: TVA format, date format, payment patterns, B2B
|
||||||
|
# - Generează cod Python cu OCR error variants
|
||||||
|
# - Suportă glob patterns (*.pdf)
|
||||||
|
# - Verifică sintaxa după generare
|
||||||
|
```
|
||||||
388
backend/modules/data_entry/services/ocr/profiles/__init__.py
Normal file
388
backend/modules/data_entry/services/ocr/profiles/__init__.py
Normal file
@@ -0,0 +1,388 @@
|
|||||||
|
"""
|
||||||
|
Store Profiles Registry with Hot-Reload Support.
|
||||||
|
|
||||||
|
This module provides a registry for store-specific OCR extraction profiles.
|
||||||
|
Profiles can be reloaded at runtime without restarting the server.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
|
||||||
|
|
||||||
|
# Get profile for a CUI
|
||||||
|
profile = ProfileRegistry.get_profile("22891860")
|
||||||
|
if profile:
|
||||||
|
tva_entries = profile.extract_tva_entries(text)
|
||||||
|
|
||||||
|
# Reload all profiles (after file changes)
|
||||||
|
count = ProfileRegistry.reload_all()
|
||||||
|
|
||||||
|
Architecture:
|
||||||
|
- ProfileRegistry: Singleton registry with class methods
|
||||||
|
- BaseStoreProfile: Abstract base class for profiles
|
||||||
|
- @ProfileRegistry.register: Decorator for profile classes
|
||||||
|
|
||||||
|
Hot-Reload Mechanism:
|
||||||
|
1. Admin calls POST /profiles/reload endpoint
|
||||||
|
2. Registry clears instance cache
|
||||||
|
3. importlib.reload() re-executes each profile module
|
||||||
|
4. @register decorator re-registers classes with new code
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import importlib
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
from datetime import datetime
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Dict, List, Optional, Type, TYPE_CHECKING
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from .base import BaseStoreProfile
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
# Directory containing profile modules
|
||||||
|
PROFILES_DIR = Path(__file__).parent
|
||||||
|
|
||||||
|
|
||||||
|
class ProfileRegistry:
|
||||||
|
"""
|
||||||
|
Registry for store-specific OCR extraction profiles.
|
||||||
|
|
||||||
|
Uses class methods for singleton-like behavior without explicit instantiation.
|
||||||
|
Supports hot-reload via importlib.reload() for runtime updates.
|
||||||
|
|
||||||
|
Attributes:
|
||||||
|
_profiles: Maps CUI -> profile class (not instance)
|
||||||
|
_instances: Maps CUI -> profile instance (lazy, cleared on reload)
|
||||||
|
_last_reload: Timestamp of last reload
|
||||||
|
_loaded: Whether initial load has been performed
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Class-level storage (singleton pattern via class methods)
|
||||||
|
_profiles: Dict[str, Type["BaseStoreProfile"]] = {}
|
||||||
|
_instances: Dict[str, "BaseStoreProfile"] = {}
|
||||||
|
_last_reload: Optional[datetime] = None
|
||||||
|
_loaded: bool = False
|
||||||
|
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
# Registration
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def register(cls, profile_class: Type["BaseStoreProfile"]) -> Type["BaseStoreProfile"]:
|
||||||
|
"""
|
||||||
|
Decorator to register a store profile class.
|
||||||
|
|
||||||
|
Registers the profile for all CUIs in the class's CUI_LIST.
|
||||||
|
Safe for re-registration during hot-reload (overwrites existing).
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
@ProfileRegistry.register
|
||||||
|
class LidlProfile(BaseStoreProfile):
|
||||||
|
CUI_LIST = ["22891860"]
|
||||||
|
...
|
||||||
|
|
||||||
|
Args:
|
||||||
|
profile_class: Profile class to register
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
The same class (allows use as decorator)
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
ValueError: If CUI_LIST is empty
|
||||||
|
"""
|
||||||
|
cui_list = getattr(profile_class, 'CUI_LIST', [])
|
||||||
|
store_name = getattr(profile_class, 'STORE_NAME', profile_class.__name__)
|
||||||
|
|
||||||
|
if not cui_list:
|
||||||
|
logger.warning(f"Profile {profile_class.__name__} has empty CUI_LIST, skipping")
|
||||||
|
return profile_class
|
||||||
|
|
||||||
|
# Register for each CUI
|
||||||
|
for cui in cui_list:
|
||||||
|
# Normalize CUI (remove RO prefix, strip whitespace)
|
||||||
|
normalized_cui = cls._normalize_cui(cui)
|
||||||
|
|
||||||
|
if normalized_cui in cls._profiles:
|
||||||
|
old_class = cls._profiles[normalized_cui]
|
||||||
|
logger.debug(
|
||||||
|
f"Re-registering CUI {normalized_cui}: "
|
||||||
|
f"{old_class.__name__} -> {profile_class.__name__}"
|
||||||
|
)
|
||||||
|
# Clear cached instance for this CUI
|
||||||
|
cls._instances.pop(normalized_cui, None)
|
||||||
|
|
||||||
|
cls._profiles[normalized_cui] = profile_class
|
||||||
|
logger.debug(f"Registered profile {profile_class.__name__} for CUI {normalized_cui}")
|
||||||
|
|
||||||
|
logger.info(f"Registered {store_name} for CUIs: {cui_list}")
|
||||||
|
return profile_class
|
||||||
|
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
# Lookup
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def get_profile(cls, cui: Optional[str]) -> Optional["BaseStoreProfile"]:
|
||||||
|
"""
|
||||||
|
Get profile instance for a CUI.
|
||||||
|
|
||||||
|
Uses lazy instantiation - creates instance on first access.
|
||||||
|
Returns None if no profile is registered for this CUI.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
cui: CUI to lookup (with or without RO prefix)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Profile instance or None
|
||||||
|
"""
|
||||||
|
if not cui:
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Ensure profiles are loaded
|
||||||
|
if not cls._loaded:
|
||||||
|
cls._load_all_profiles()
|
||||||
|
|
||||||
|
normalized_cui = cls._normalize_cui(cui)
|
||||||
|
|
||||||
|
# Check if profile exists
|
||||||
|
profile_class = cls._profiles.get(normalized_cui)
|
||||||
|
if not profile_class:
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Lazy instantiation
|
||||||
|
if normalized_cui not in cls._instances:
|
||||||
|
try:
|
||||||
|
cls._instances[normalized_cui] = profile_class()
|
||||||
|
logger.debug(f"Instantiated {profile_class.__name__} for CUI {normalized_cui}")
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to instantiate {profile_class.__name__}: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
return cls._instances[normalized_cui]
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def has_profile(cls, cui: Optional[str]) -> bool:
|
||||||
|
"""Check if a profile exists for this CUI."""
|
||||||
|
if not cui:
|
||||||
|
return False
|
||||||
|
if not cls._loaded:
|
||||||
|
cls._load_all_profiles()
|
||||||
|
return cls._normalize_cui(cui) in cls._profiles
|
||||||
|
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
# Listing
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def list_profiles(cls) -> List[Dict]:
|
||||||
|
"""
|
||||||
|
List all registered profiles.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of dicts with cui, class_name, store_name, name_patterns
|
||||||
|
"""
|
||||||
|
if not cls._loaded:
|
||||||
|
cls._load_all_profiles()
|
||||||
|
|
||||||
|
result = []
|
||||||
|
seen_classes = set()
|
||||||
|
|
||||||
|
for cui, profile_class in cls._profiles.items():
|
||||||
|
# Avoid duplicates for profiles with multiple CUIs
|
||||||
|
if profile_class.__name__ in seen_classes:
|
||||||
|
continue
|
||||||
|
seen_classes.add(profile_class.__name__)
|
||||||
|
|
||||||
|
result.append({
|
||||||
|
"cuis": list(getattr(profile_class, 'CUI_LIST', [])),
|
||||||
|
"class_name": profile_class.__name__,
|
||||||
|
"store_name": getattr(profile_class, 'STORE_NAME', profile_class.__name__),
|
||||||
|
"name_patterns": list(getattr(profile_class, 'NAME_PATTERNS', [])),
|
||||||
|
})
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def get_profile_info(cls, cui: str) -> Optional[Dict]:
|
||||||
|
"""
|
||||||
|
Get detailed info about a profile.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
cui: CUI to lookup
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict with profile details or None
|
||||||
|
"""
|
||||||
|
profile = cls.get_profile(cui)
|
||||||
|
if not profile:
|
||||||
|
return None
|
||||||
|
|
||||||
|
return {
|
||||||
|
"cui": cui,
|
||||||
|
"cuis": list(profile.CUI_LIST),
|
||||||
|
"class_name": profile.__class__.__name__,
|
||||||
|
"store_name": profile.STORE_NAME,
|
||||||
|
"name_patterns": list(profile.NAME_PATTERNS),
|
||||||
|
"validation_hints": profile.get_validation_hints(),
|
||||||
|
}
|
||||||
|
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
# Hot-Reload
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def reload_all(cls) -> int:
|
||||||
|
"""
|
||||||
|
Hot-reload all profile modules.
|
||||||
|
|
||||||
|
Clears instance cache and reloads all .py files in profiles directory.
|
||||||
|
Decorator re-registers classes with updated code.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Number of modules reloaded
|
||||||
|
"""
|
||||||
|
logger.info("Starting profile hot-reload...")
|
||||||
|
|
||||||
|
# Clear instance cache (will be recreated on next get_profile)
|
||||||
|
cls._instances.clear()
|
||||||
|
|
||||||
|
# Get list of profile modules (exclude __init__, base)
|
||||||
|
module_names = cls._get_profile_module_names()
|
||||||
|
|
||||||
|
count = 0
|
||||||
|
for module_name in module_names:
|
||||||
|
full_name = f"backend.modules.data_entry.services.ocr.profiles.{module_name}"
|
||||||
|
|
||||||
|
try:
|
||||||
|
if full_name in sys.modules:
|
||||||
|
# Reload existing module
|
||||||
|
importlib.reload(sys.modules[full_name])
|
||||||
|
logger.debug(f"Reloaded module: {module_name}")
|
||||||
|
else:
|
||||||
|
# Import new module
|
||||||
|
importlib.import_module(full_name)
|
||||||
|
logger.debug(f"Imported new module: {module_name}")
|
||||||
|
count += 1
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to reload {module_name}: {e}")
|
||||||
|
|
||||||
|
cls._last_reload = datetime.utcnow()
|
||||||
|
cls._loaded = True
|
||||||
|
|
||||||
|
logger.info(f"Profile hot-reload complete: {count} modules, {len(cls._profiles)} profiles")
|
||||||
|
return count
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def get_reload_status(cls) -> Dict:
|
||||||
|
"""Get status of the registry including last reload time."""
|
||||||
|
return {
|
||||||
|
"loaded": cls._loaded,
|
||||||
|
"last_reload": cls._last_reload.isoformat() if cls._last_reload else None,
|
||||||
|
"profiles_count": len(cls._profiles),
|
||||||
|
"instances_count": len(cls._instances),
|
||||||
|
"registered_cuis": list(cls._profiles.keys()),
|
||||||
|
}
|
||||||
|
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
# Internal methods
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def _normalize_cui(cls, cui: str) -> str:
|
||||||
|
"""
|
||||||
|
Normalize CUI for consistent lookup.
|
||||||
|
|
||||||
|
- Removes RO prefix (with or without space)
|
||||||
|
- Strips whitespace
|
||||||
|
- Converts to uppercase
|
||||||
|
|
||||||
|
Args:
|
||||||
|
cui: Raw CUI string
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Normalized CUI (digits only)
|
||||||
|
"""
|
||||||
|
if not cui:
|
||||||
|
return ""
|
||||||
|
|
||||||
|
cui = str(cui).strip().upper()
|
||||||
|
|
||||||
|
# Remove RO prefix (handles "RO12345" and "RO 12345")
|
||||||
|
if cui.startswith("RO"):
|
||||||
|
cui = cui[2:].lstrip()
|
||||||
|
|
||||||
|
return cui.strip()
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def _get_profile_module_names(cls) -> List[str]:
|
||||||
|
"""
|
||||||
|
Get list of profile module names from profiles directory.
|
||||||
|
|
||||||
|
Excludes __init__.py and base.py.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of module names (without .py extension)
|
||||||
|
"""
|
||||||
|
excluded = {"__init__", "base", "__pycache__"}
|
||||||
|
modules = []
|
||||||
|
|
||||||
|
for path in PROFILES_DIR.glob("*.py"):
|
||||||
|
name = path.stem
|
||||||
|
if name not in excluded:
|
||||||
|
modules.append(name)
|
||||||
|
|
||||||
|
return sorted(modules)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def _load_all_profiles(cls) -> None:
|
||||||
|
"""
|
||||||
|
Initial load of all profile modules.
|
||||||
|
|
||||||
|
Called automatically on first get_profile() if not already loaded.
|
||||||
|
"""
|
||||||
|
if cls._loaded:
|
||||||
|
return
|
||||||
|
|
||||||
|
logger.info("Loading store profiles...")
|
||||||
|
|
||||||
|
module_names = cls._get_profile_module_names()
|
||||||
|
|
||||||
|
for module_name in module_names:
|
||||||
|
full_name = f"backend.modules.data_entry.services.ocr.profiles.{module_name}"
|
||||||
|
try:
|
||||||
|
importlib.import_module(full_name)
|
||||||
|
logger.debug(f"Loaded module: {module_name}")
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to load {module_name}: {e}")
|
||||||
|
|
||||||
|
cls._loaded = True
|
||||||
|
cls._last_reload = datetime.utcnow()
|
||||||
|
|
||||||
|
logger.info(f"Loaded {len(cls._profiles)} store profiles")
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def clear(cls) -> None:
|
||||||
|
"""
|
||||||
|
Clear all registered profiles.
|
||||||
|
|
||||||
|
Mainly useful for testing.
|
||||||
|
"""
|
||||||
|
cls._profiles.clear()
|
||||||
|
cls._instances.clear()
|
||||||
|
cls._loaded = False
|
||||||
|
cls._last_reload = None
|
||||||
|
|
||||||
|
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
# Module exports
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"ProfileRegistry",
|
||||||
|
"BaseStoreProfile",
|
||||||
|
]
|
||||||
|
|
||||||
|
# Re-export BaseStoreProfile for convenience
|
||||||
|
from .base import BaseStoreProfile
|
||||||
515
backend/modules/data_entry/services/ocr/profiles/base.py
Normal file
515
backend/modules/data_entry/services/ocr/profiles/base.py
Normal file
@@ -0,0 +1,515 @@
|
|||||||
|
"""
|
||||||
|
Base class for store-specific OCR extraction profiles.
|
||||||
|
|
||||||
|
Each store can have different receipt formats (TVA layout, total position, etc.).
|
||||||
|
Store profiles allow customizing extraction logic per-store for better accuracy.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
from .base import BaseStoreProfile
|
||||||
|
from . import ProfileRegistry
|
||||||
|
|
||||||
|
@ProfileRegistry.register
|
||||||
|
class LidlProfile(BaseStoreProfile):
|
||||||
|
CUI_LIST = ["22891860"]
|
||||||
|
NAME_PATTERNS = ["LIDL", "LDL"]
|
||||||
|
|
||||||
|
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||||
|
# Custom Lidl TVA extraction logic
|
||||||
|
...
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
from abc import ABC
|
||||||
|
from decimal import Decimal, InvalidOperation
|
||||||
|
from typing import List, Optional, Tuple, Dict, Any
|
||||||
|
from datetime import date
|
||||||
|
|
||||||
|
|
||||||
|
class BaseStoreProfile(ABC):
|
||||||
|
"""
|
||||||
|
Abstract base class for store-specific extraction profiles.
|
||||||
|
|
||||||
|
Each profile defines:
|
||||||
|
- CUI_LIST: CUI codes that identify this store (without RO prefix)
|
||||||
|
- NAME_PATTERNS: OCR-tolerant name patterns for fallback matching
|
||||||
|
- Custom extraction methods for TVA, total, date, etc.
|
||||||
|
|
||||||
|
The ProfileRegistry uses CUI_LIST to lookup profiles during extraction.
|
||||||
|
"""
|
||||||
|
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
# Class attributes - override in subclasses
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
|
||||||
|
# List of CUI codes (without RO prefix) that identify this store
|
||||||
|
CUI_LIST: List[str] = []
|
||||||
|
|
||||||
|
# OCR-tolerant name patterns for fallback matching
|
||||||
|
NAME_PATTERNS: List[str] = []
|
||||||
|
|
||||||
|
# Store display name
|
||||||
|
STORE_NAME: str = "Unknown Store"
|
||||||
|
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
# Generic patterns - can be overridden in subclasses
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
|
||||||
|
# Total amount patterns (confidence-weighted)
|
||||||
|
TOTAL_PATTERNS = [
|
||||||
|
(r'T[O0]TAL[.\s]+L[E3][I1!]\s*:?\s*([\d\s.,]+)', 0.98),
|
||||||
|
(r'TOTAL\s+LEI\s*([\d\s.,]+)', 0.98),
|
||||||
|
(r'[OT]?OTAL\s+LEI\s*([\d\s.,]+)', 0.95),
|
||||||
|
(r'TOTAL\s*:?\s*([\d\s.,]+)\s*(?:RON|LEI)?', 0.95),
|
||||||
|
(r'TOTAL\s+(?:RON|LEI)\s*([\d\s.,]+)', 0.95),
|
||||||
|
(r'SUBTOTAL\s*([\d\s.,]+)', 0.90),
|
||||||
|
(r'DE\s+PLATA\s*:?\s*([\d\s.,]+)', 0.90),
|
||||||
|
(r'SUMA\s*:?\s*([\d\s.,]+)', 0.85),
|
||||||
|
]
|
||||||
|
|
||||||
|
# Date patterns (confidence-weighted)
|
||||||
|
DATE_PATTERNS = [
|
||||||
|
(r'D[AR]TA\s*:?\s*(\d{2}[-./]\d{2}[-./]\d{4})', 0.98),
|
||||||
|
(r'DATA\s*:?\s*(\d{2}[-./]\d{2}[-./]\d{4})', 0.98),
|
||||||
|
(r'(\d{2}[-./]\d{2}[-./]\d{4})\s+[O0]RA\s*:?\s*\d{2}:\d{2}', 0.95),
|
||||||
|
(r'(\d{2}[-./]\d{2}[-./]\d{4})\s+\d{2}:\d{2}', 0.90),
|
||||||
|
(r'(\d{2}[-./]\d{2}[-./]\d{4})', 0.80),
|
||||||
|
(r'(\d{4}[-./]\d{2}[-./]\d{2})', 0.75),
|
||||||
|
]
|
||||||
|
|
||||||
|
# Date patterns with OCR-introduced spaces (separate because format is different)
|
||||||
|
DATE_PATTERNS_OCR_SPACES = [
|
||||||
|
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})\s+\d{2}:\d{2}', 0.92, 'ymd'),
|
||||||
|
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})', 0.85, 'ymd'),
|
||||||
|
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})\s+\d{2}:\d{2}', 0.92, 'dmy'),
|
||||||
|
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})', 0.85, 'dmy'),
|
||||||
|
]
|
||||||
|
|
||||||
|
# Receipt number patterns (confidence-weighted)
|
||||||
|
NUMBER_PATTERNS = [
|
||||||
|
(r'NDS\s*:?\s*(\d+)', 0.98),
|
||||||
|
(r'C3POS[-A-Z0-9]*[N:](\d{6,7})', 0.98),
|
||||||
|
(r'C3POS.*?(\d{6,7})\b', 0.95),
|
||||||
|
(r'BF\s*:\s*(\d{4,})', 0.96),
|
||||||
|
(r'BF\s+(\d{4,})', 0.93),
|
||||||
|
(r'NIVS\s*:?\s*(\d+)', 0.95),
|
||||||
|
(r'NR\.?\s*BON\s*:?\s*(\d+)', 0.95),
|
||||||
|
(r'BON\s+(?:FISCAL\s+)?NR\.?\s*:?\s*(\d+)', 0.95),
|
||||||
|
(r'CHITANTA\s+NR\.?\s*:?\s*(\d+)', 0.95),
|
||||||
|
(r'NR\.?\s+DOCUMENT\s*:?\s*(\d+)', 0.90),
|
||||||
|
(r'ID\s*BF\s*:?\s*(\d+)', 0.90),
|
||||||
|
]
|
||||||
|
|
||||||
|
# Payment method patterns (pattern, method_type, confidence)
|
||||||
|
PAYMENT_PATTERNS = [
|
||||||
|
(r'CARTE\s+CREDIT\s*:?\s*([\d\s.,]+)', 'CARD', 0.98),
|
||||||
|
(r'CARTE\s+CREDIT\s*:?\s*\n\s*([\d\s.,]+)', 'CARD', 0.97),
|
||||||
|
(r'(?:PLATA\s+)?CARD\s*[:\sA-Z]?\s*([\d\s.,]+)', 'CARD', 0.95),
|
||||||
|
(r'NUMERAR\s*:?\s*([\d\s.,]+)', 'NUMERAR', 0.95),
|
||||||
|
(r'CASH\s*:?\s*([\d\s.,]+)', 'NUMERAR', 0.90),
|
||||||
|
(r'(?:^|\n|\s)RD\s*:?\s*(\d{1,6}[.,]\d{2})\b', 'CARD', 0.70),
|
||||||
|
(r'(?:^|\n|\s)ARD\s*:?\s*(\d{1,6}[.,]\d{2})\b', 'CARD', 0.75),
|
||||||
|
(r'(?:^|\n|\s)MERAR\s*:?\s*(\d{1,6}[.,]\d{2})\b', 'NUMERAR', 0.70),
|
||||||
|
]
|
||||||
|
|
||||||
|
# Client section markers (for B2B receipts)
|
||||||
|
CLIENT_MARKERS = [
|
||||||
|
r'C\.?\s*[I1]\.?\s*F\.?\s+CLIENT\s*:',
|
||||||
|
r'C\.?\s*U\.?\s*[I1]\.?\s+CLIENT\s*:',
|
||||||
|
r'CLIENT\s+C\.?\s*[UI1]\.?\s*[IF1]\.?\s*:',
|
||||||
|
r'CLIENT\s*:',
|
||||||
|
r'CUMPARATOR\s*:',
|
||||||
|
r'BENEFICIAR\s*:',
|
||||||
|
]
|
||||||
|
|
||||||
|
# Client CUI patterns (pattern, confidence)
|
||||||
|
CLIENT_CUI_PATTERNS = [
|
||||||
|
(r'(R[O0]\d{6,10})\s*\n\s*CLIENT\s+C\.?\s*U\.?\s*[I1]\.?', 0.99),
|
||||||
|
(r'(R[O0]\d{6,10})\s*:?\s*\n\s*CLIENT', 0.98),
|
||||||
|
(r'C[I1]F\s+[A-Z]*\s*CLIENT\s*:?\s*(R[O0]\d{6,10})', 0.98),
|
||||||
|
(r'C\.?\s*[I1]\.?\s*F\.?\s+CLIENT\s*:?\s*(R[O0]?\d{6,10})', 0.98),
|
||||||
|
(r'C\.?\s*U\.?\s*[I1]\.?\s+CLIENT\s*:?\s*(R[O0]?\d{6,10})', 0.98),
|
||||||
|
(r'CLIENT\s+C\.?\s*U\.?\s*[I1]\.?\s*:?\s*(R[O0]?\d{6,10})', 0.95),
|
||||||
|
(r'CLIENT\s*:?\s*(R[O0]?\d{6,10})', 0.90),
|
||||||
|
]
|
||||||
|
|
||||||
|
# Company type indicators (for identifying company names)
|
||||||
|
COMPANY_INDICATORS = [
|
||||||
|
r'\bS\.?\s*R\.?\s*L\.?\b', # S.R.L. or S. R. L.
|
||||||
|
r'\bS\.?\s*A\.?\b', # S.A. or S. A.
|
||||||
|
r'\bS\.?\s*N\.?\s*C\.?\b', # S.N.C. or S. N. C.
|
||||||
|
r'\bS\.?\s*C\.?\s*S\.?\b', # S.C.S. or S. C. S.
|
||||||
|
r'\bI\.?\s*I\.?\b', # I.I. or I. I.
|
||||||
|
r'\bP\.?\s*F\.?\s*A\.?\b', # P.F.A. or P. F. A.
|
||||||
|
r'\bS\.?\s*C\.?\s+[A-Z]', # S.C. followed by company name
|
||||||
|
r'HOLDING',
|
||||||
|
r'COMPANY',
|
||||||
|
r'GROUP',
|
||||||
|
]
|
||||||
|
|
||||||
|
# Maximum reasonable payment amount (to filter OCR errors)
|
||||||
|
MAX_PAYMENT = Decimal('100000')
|
||||||
|
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
# Extraction methods - override in subclasses as needed
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||||
|
"""
|
||||||
|
Extract TVA entries from receipt text.
|
||||||
|
|
||||||
|
Override this method in subclasses to handle store-specific TVA formats.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of dicts with keys: code, percent, amount
|
||||||
|
"""
|
||||||
|
return []
|
||||||
|
|
||||||
|
def extract_total(self, text: str) -> Tuple[Optional[Decimal], float]:
|
||||||
|
"""
|
||||||
|
Extract total amount from receipt text.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple of (amount, confidence) or (None, 0.0)
|
||||||
|
"""
|
||||||
|
text_upper = text.upper()
|
||||||
|
|
||||||
|
for pattern, confidence in self.TOTAL_PATTERNS:
|
||||||
|
match = re.search(pattern, text_upper)
|
||||||
|
if match:
|
||||||
|
amount = self._parse_decimal(match.group(1))
|
||||||
|
if amount and amount > 0 and amount < self.MAX_PAYMENT:
|
||||||
|
return (amount, confidence)
|
||||||
|
|
||||||
|
return (None, 0.0)
|
||||||
|
|
||||||
|
def extract_date(self, text: str) -> Tuple[Optional[date], float]:
|
||||||
|
"""
|
||||||
|
Extract receipt date from text.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple of (date, confidence) or (None, 0.0)
|
||||||
|
"""
|
||||||
|
text_upper = text.upper()
|
||||||
|
|
||||||
|
# Try standard patterns first
|
||||||
|
for pattern, confidence in self.DATE_PATTERNS:
|
||||||
|
match = re.search(pattern, text_upper)
|
||||||
|
if match:
|
||||||
|
parsed = self._parse_date(match.group(1))
|
||||||
|
if parsed:
|
||||||
|
return (parsed, confidence)
|
||||||
|
|
||||||
|
# Try OCR-corrupted patterns with spaces
|
||||||
|
for pattern, confidence, fmt in self.DATE_PATTERNS_OCR_SPACES:
|
||||||
|
match = re.search(pattern, text_upper)
|
||||||
|
if match:
|
||||||
|
try:
|
||||||
|
if fmt == 'ymd':
|
||||||
|
year, month, day = int(match.group(1)), int(match.group(2)), int(match.group(3))
|
||||||
|
else: # dmy
|
||||||
|
day, month, year = int(match.group(1)), int(match.group(2)), int(match.group(3))
|
||||||
|
|
||||||
|
if 1 <= day <= 31 and 1 <= month <= 12 and 2000 <= year <= 2100:
|
||||||
|
return (date(year, month, day), confidence)
|
||||||
|
except (ValueError, TypeError):
|
||||||
|
continue
|
||||||
|
|
||||||
|
return (None, 0.0)
|
||||||
|
|
||||||
|
def extract_receipt_number(self, text: str) -> Tuple[Optional[str], float]:
|
||||||
|
"""
|
||||||
|
Extract receipt number from text.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple of (number, confidence) or (None, 0.0)
|
||||||
|
"""
|
||||||
|
text_upper = text.upper()
|
||||||
|
|
||||||
|
for pattern, confidence in self.NUMBER_PATTERNS:
|
||||||
|
match = re.search(pattern, text_upper)
|
||||||
|
if match:
|
||||||
|
number = match.group(1).strip()
|
||||||
|
if number and len(number) >= 3:
|
||||||
|
return (number, confidence)
|
||||||
|
|
||||||
|
return (None, 0.0)
|
||||||
|
|
||||||
|
def extract_payment_methods(self, text: str) -> List[dict]:
|
||||||
|
"""
|
||||||
|
Extract payment methods (CARD/NUMERAR) from receipt.
|
||||||
|
|
||||||
|
Supports multiple payments of the same type (e.g., 2x CARD for split payments).
|
||||||
|
Each payment is returned as a separate entry with its amount.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of dicts: [{'method': 'CARD'/'NUMERAR', 'amount': Decimal, 'confidence': float}]
|
||||||
|
Multiple entries of same method type are allowed for split payments.
|
||||||
|
"""
|
||||||
|
text_upper = text.upper()
|
||||||
|
methods = []
|
||||||
|
# Track (method, amount) pairs to avoid exact duplicates from overlapping patterns
|
||||||
|
seen_entries = set()
|
||||||
|
|
||||||
|
for pattern, method, confidence in self.PAYMENT_PATTERNS:
|
||||||
|
for match in re.finditer(pattern, text_upper):
|
||||||
|
try:
|
||||||
|
amount = self._parse_decimal(match.group(1))
|
||||||
|
if amount and amount > 0 and amount < self.MAX_PAYMENT:
|
||||||
|
# Deduplicate by (method, amount) to avoid same entry from multiple patterns
|
||||||
|
# But allow different amounts for same method (split payments)
|
||||||
|
entry_key = (method, amount)
|
||||||
|
if entry_key not in seen_entries:
|
||||||
|
methods.append({
|
||||||
|
'method': method,
|
||||||
|
'amount': amount,
|
||||||
|
'confidence': confidence
|
||||||
|
})
|
||||||
|
seen_entries.add(entry_key)
|
||||||
|
except (ValueError, InvalidOperation):
|
||||||
|
continue
|
||||||
|
|
||||||
|
return methods
|
||||||
|
|
||||||
|
def extract_client_cui(self, text: str) -> Tuple[Optional[str], float]:
|
||||||
|
"""
|
||||||
|
Extract client CUI from B2B receipts.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple of (cui, confidence) or (None, 0.0)
|
||||||
|
"""
|
||||||
|
text_upper = text.upper()
|
||||||
|
|
||||||
|
# First check if there's a CLIENT section
|
||||||
|
has_client_section = any(
|
||||||
|
re.search(marker, text_upper, re.IGNORECASE)
|
||||||
|
for marker in self.CLIENT_MARKERS
|
||||||
|
)
|
||||||
|
|
||||||
|
if not has_client_section:
|
||||||
|
return (None, 0.0)
|
||||||
|
|
||||||
|
# Try to extract CUI
|
||||||
|
for pattern, confidence in self.CLIENT_CUI_PATTERNS:
|
||||||
|
match = re.search(pattern, text_upper, re.IGNORECASE | re.MULTILINE)
|
||||||
|
if match:
|
||||||
|
cui = match.group(1)
|
||||||
|
# Normalize: remove RO prefix for storage
|
||||||
|
cui_digits = re.sub(r'[^0-9]', '', cui)
|
||||||
|
if 6 <= len(cui_digits) <= 10:
|
||||||
|
return (cui_digits, confidence)
|
||||||
|
|
||||||
|
return (None, 0.0)
|
||||||
|
|
||||||
|
def extract_client_name(self, text: str) -> Tuple[Optional[str], float]:
|
||||||
|
"""
|
||||||
|
Extract client/buyer company name from B2B receipts.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple of (client_name, confidence) or (None, 0.0)
|
||||||
|
"""
|
||||||
|
text_upper = text.upper()
|
||||||
|
lines = text.split('\n')
|
||||||
|
|
||||||
|
# First check if there's a CLIENT section
|
||||||
|
client_section_idx = None
|
||||||
|
for i, line in enumerate(lines):
|
||||||
|
line_upper = line.upper().strip()
|
||||||
|
if any(re.search(marker, line_upper, re.IGNORECASE) for marker in self.CLIENT_MARKERS):
|
||||||
|
client_section_idx = i
|
||||||
|
break
|
||||||
|
|
||||||
|
if client_section_idx is None:
|
||||||
|
return (None, 0.0)
|
||||||
|
|
||||||
|
# Look for company name in CLIENT section
|
||||||
|
line = lines[client_section_idx].strip()
|
||||||
|
line_upper = line.upper()
|
||||||
|
|
||||||
|
# Strategy 1: Check if name is on same line after ":"
|
||||||
|
if ':' in line:
|
||||||
|
name_part = line.split(':', 1)[1].strip()
|
||||||
|
if name_part and len(name_part) >= 3:
|
||||||
|
# Skip if it looks like a CUI (RO followed by digits)
|
||||||
|
if re.match(r'^R[O0]?\d{6,10}$', name_part.upper()):
|
||||||
|
pass # This is CUI, not name - continue to next strategy
|
||||||
|
else:
|
||||||
|
# Check for company indicators
|
||||||
|
name_upper = name_part.upper()
|
||||||
|
if any(re.search(ind, name_upper) for ind in self.COMPANY_INDICATORS):
|
||||||
|
return (self._clean_company_name(name_part), 0.95)
|
||||||
|
elif len(name_part) >= 5 and not name_part.isdigit():
|
||||||
|
return (self._clean_company_name(name_part), 0.80)
|
||||||
|
|
||||||
|
# Strategy 2: Check next line for company name
|
||||||
|
if client_section_idx + 1 < len(lines):
|
||||||
|
next_line = lines[client_section_idx + 1].strip()
|
||||||
|
next_upper = next_line.upper()
|
||||||
|
|
||||||
|
# Skip if it's a CUI/CIF line or looks like CUI
|
||||||
|
if not re.search(r'C\.?\s*[UI]\.?\s*[IF]\.?', next_upper):
|
||||||
|
if not re.match(r'^R[O0]?\d{6,10}$', next_upper):
|
||||||
|
if any(re.search(ind, next_upper) for ind in self.COMPANY_INDICATORS):
|
||||||
|
return (self._clean_company_name(next_line), 0.90)
|
||||||
|
elif len(next_line) >= 5 and not next_line.isdigit():
|
||||||
|
# Check it's not CUI/CIF/COD keywords
|
||||||
|
if not any(kw in next_upper for kw in ['CUI', 'CIF', 'COD', 'FISCAL']):
|
||||||
|
return (self._clean_company_name(next_line), 0.75)
|
||||||
|
|
||||||
|
# Strategy 3: Look for any line with company indicators in CLIENT section region
|
||||||
|
search_end = min(client_section_idx + 5, len(lines))
|
||||||
|
for i in range(client_section_idx + 1, search_end):
|
||||||
|
line = lines[i].strip()
|
||||||
|
line_upper = line.upper()
|
||||||
|
|
||||||
|
# Skip CUI/CIF lines
|
||||||
|
if re.search(r'C\.?\s*[UI]\.?\s*[IF]\.?', line_upper):
|
||||||
|
continue
|
||||||
|
if re.match(r'^R[O0]?\d{6,10}$', line_upper):
|
||||||
|
continue
|
||||||
|
|
||||||
|
if any(re.search(ind, line_upper) for ind in self.COMPANY_INDICATORS):
|
||||||
|
return (self._clean_company_name(line), 0.85)
|
||||||
|
|
||||||
|
return (None, 0.0)
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def _clean_company_name(name: str) -> str:
|
||||||
|
"""Clean company name for storage."""
|
||||||
|
if not name:
|
||||||
|
return ""
|
||||||
|
# Remove extra whitespace
|
||||||
|
name = re.sub(r'\s+', ' ', name).strip()
|
||||||
|
# Remove trailing punctuation except periods in S.R.L., S.A., etc.
|
||||||
|
name = re.sub(r'[,;:]+$', '', name).strip()
|
||||||
|
return name
|
||||||
|
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
# Validation hints - override to customize validation behavior
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def get_validation_hints(self) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Return validation hints for this store.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict with validation hints. Common keys:
|
||||||
|
- has_multi_rate_tva: bool - Store uses multiple TVA rates
|
||||||
|
- card_equals_total: bool - CARD payment equals total
|
||||||
|
- has_client_cui: bool - Receipt includes client CUI
|
||||||
|
- has_efactura: bool - Store uses e-factura format
|
||||||
|
- is_non_vat_payer: bool - Store is not a VAT payer
|
||||||
|
"""
|
||||||
|
return {}
|
||||||
|
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
# Helper methods - available to all subclasses
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def _normalize_number(text: str) -> str:
|
||||||
|
"""
|
||||||
|
Normalize a number string for Decimal conversion.
|
||||||
|
|
||||||
|
Handles Romanian formats: "1.234,56" -> "1234.56"
|
||||||
|
"""
|
||||||
|
if not text:
|
||||||
|
return "0"
|
||||||
|
|
||||||
|
# Remove spaces
|
||||||
|
text = text.replace(" ", "")
|
||||||
|
|
||||||
|
# Determine decimal separator
|
||||||
|
last_comma = text.rfind(",")
|
||||||
|
last_dot = text.rfind(".")
|
||||||
|
|
||||||
|
if last_comma > last_dot:
|
||||||
|
text = text.replace(".", "").replace(",", ".")
|
||||||
|
elif last_dot > last_comma:
|
||||||
|
text = text.replace(",", "")
|
||||||
|
else:
|
||||||
|
text = text.replace(",", ".")
|
||||||
|
|
||||||
|
return text
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def _parse_decimal(text: str) -> Optional[Decimal]:
|
||||||
|
"""Parse a string to Decimal, handling various formats."""
|
||||||
|
try:
|
||||||
|
normalized = BaseStoreProfile._normalize_number(text)
|
||||||
|
return Decimal(normalized)
|
||||||
|
except (InvalidOperation, ValueError, TypeError):
|
||||||
|
return None
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def _parse_date(text: str) -> Optional[date]:
|
||||||
|
"""
|
||||||
|
Parse date string in various formats.
|
||||||
|
|
||||||
|
Supports: DD-MM-YYYY, DD/MM/YYYY, DD.MM.YYYY, YYYY-MM-DD
|
||||||
|
"""
|
||||||
|
if not text:
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Normalize separators
|
||||||
|
text = text.replace('/', '-').replace('.', '-')
|
||||||
|
|
||||||
|
try:
|
||||||
|
parts = text.split('-')
|
||||||
|
if len(parts) != 3:
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Determine format based on first part length
|
||||||
|
if len(parts[0]) == 4:
|
||||||
|
# YYYY-MM-DD
|
||||||
|
year, month, day = int(parts[0]), int(parts[1]), int(parts[2])
|
||||||
|
else:
|
||||||
|
# DD-MM-YYYY
|
||||||
|
day, month, year = int(parts[0]), int(parts[1]), int(parts[2])
|
||||||
|
|
||||||
|
# Validate ranges
|
||||||
|
if 1 <= day <= 31 and 1 <= month <= 12 and 2000 <= year <= 2100:
|
||||||
|
return date(year, month, day)
|
||||||
|
except (ValueError, TypeError, IndexError):
|
||||||
|
pass
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def _clean_text(text: str) -> str:
|
||||||
|
"""Clean OCR text for pattern matching."""
|
||||||
|
if not text:
|
||||||
|
return ""
|
||||||
|
text = re.sub(r'\s+', ' ', text)
|
||||||
|
text = re.sub(r'[\x00-\x09\x0b\x0c\x0e-\x1f\x7f]', '', text)
|
||||||
|
return text.strip()
|
||||||
|
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
# Magic methods
|
||||||
|
# -------------------------------------------------------------------------
|
||||||
|
|
||||||
|
def __repr__(self) -> str:
|
||||||
|
return f"<{self.__class__.__name__} CUI={self.CUI_LIST}>"
|
||||||
|
|
||||||
|
def __str__(self) -> str:
|
||||||
|
return f"{self.STORE_NAME} ({', '.join(self.CUI_LIST)})"
|
||||||
@@ -0,0 +1,54 @@
|
|||||||
|
"""
|
||||||
|
BEST PRINT TRADE ACTIV SRL store profile for OCR extraction.
|
||||||
|
|
||||||
|
Stamp manufacturing service. Non-VAT payer (neplătitor de TVA).
|
||||||
|
"""
|
||||||
|
|
||||||
|
from typing import List, Dict, Any
|
||||||
|
|
||||||
|
from .base import BaseStoreProfile
|
||||||
|
from . import ProfileRegistry
|
||||||
|
|
||||||
|
|
||||||
|
@ProfileRegistry.register
|
||||||
|
class BestPrintProfile(BaseStoreProfile):
|
||||||
|
"""
|
||||||
|
BEST PRINT TRADE ACTIV SRL - non-VAT payer profile.
|
||||||
|
|
||||||
|
Key characteristics:
|
||||||
|
- Non-VAT payer (neplătitor de TVA) - NO TVA on receipts
|
||||||
|
- Stamp manufacturing and printing services
|
||||||
|
- Total amount has no TVA component
|
||||||
|
- CARD payment typical
|
||||||
|
"""
|
||||||
|
|
||||||
|
CUI_LIST = ["45417955"]
|
||||||
|
NAME_PATTERNS = ["BEST PRINT", "BESTPRINT", "BEST PRINT TRADE", "BEST PR1NT"]
|
||||||
|
STORE_NAME = "BEST PRINT TRADE ACTIV SRL"
|
||||||
|
|
||||||
|
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||||
|
"""
|
||||||
|
Extract TVA entries - returns empty for non-VAT payer.
|
||||||
|
|
||||||
|
BEST PRINT is a non-VAT payer (neplătitor de TVA),
|
||||||
|
so no TVA entries are expected on receipts.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt (unused)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Empty list (non-VAT payer has no TVA)
|
||||||
|
"""
|
||||||
|
# Non-VAT payer - no TVA entries
|
||||||
|
return []
|
||||||
|
|
||||||
|
def get_validation_hints(self) -> Dict[str, Any]:
|
||||||
|
"""Return BEST PRINT-specific validation hints."""
|
||||||
|
return {
|
||||||
|
"has_multi_rate_tva": False,
|
||||||
|
"card_equals_total": True,
|
||||||
|
"has_client_cui": True, # May have client CUI
|
||||||
|
"has_efactura": False,
|
||||||
|
"is_non_vat_payer": True, # CRITICAL: Non-VAT payer
|
||||||
|
"tva_pattern": "none",
|
||||||
|
}
|
||||||
101
backend/modules/data_entry/services/ocr/profiles/brick.py
Normal file
101
backend/modules/data_entry/services/ocr/profiles/brick.py
Normal file
@@ -0,0 +1,101 @@
|
|||||||
|
"""
|
||||||
|
BRICK (Five-Holding) store profile for OCR extraction.
|
||||||
|
|
||||||
|
Five-Holding S.A. operates BRICK stores with standard receipt format.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
from decimal import Decimal, InvalidOperation
|
||||||
|
from typing import List, Dict, Any
|
||||||
|
|
||||||
|
from .base import BaseStoreProfile
|
||||||
|
from . import ProfileRegistry
|
||||||
|
|
||||||
|
|
||||||
|
@ProfileRegistry.register
|
||||||
|
class BrickProfile(BaseStoreProfile):
|
||||||
|
"""
|
||||||
|
FIVE-HOLDING S.A. (BRICK) - standard TVA format.
|
||||||
|
|
||||||
|
Key characteristics:
|
||||||
|
- Standard TVA format
|
||||||
|
- Single TVA rate typically
|
||||||
|
- No client CUI on receipts
|
||||||
|
"""
|
||||||
|
|
||||||
|
CUI_LIST = ["10562600"]
|
||||||
|
NAME_PATTERNS = ["BRICK", "FIVE-HOLDING", "FIVE HOLDING", "BR1CK"] # OCR variants
|
||||||
|
STORE_NAME = "FIVE-HOLDING S.A."
|
||||||
|
|
||||||
|
# Standard TVA patterns (flexible - accepts any rate)
|
||||||
|
TVA_PATTERNS = [
|
||||||
|
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
|
||||||
|
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||||
|
# "A - XX,XX% = YY,YY"
|
||||||
|
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||||
|
# Simple: "TVA XX% YY,YY"
|
||||||
|
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||||
|
]
|
||||||
|
|
||||||
|
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||||
|
"""
|
||||||
|
Extract BRICK-specific TVA entries.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of TVA entries with code, percent, and amount
|
||||||
|
"""
|
||||||
|
entries = []
|
||||||
|
seen = set()
|
||||||
|
|
||||||
|
# Try coded patterns first
|
||||||
|
for pattern in self.TVA_PATTERNS[:2]:
|
||||||
|
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||||
|
try:
|
||||||
|
code = match.group(1).upper()
|
||||||
|
percent = int(match.group(2))
|
||||||
|
amount = self._parse_decimal(match.group(3))
|
||||||
|
|
||||||
|
if amount and amount > 0:
|
||||||
|
entry_key = (code, percent)
|
||||||
|
if entry_key not in seen:
|
||||||
|
entries.append({
|
||||||
|
'code': code,
|
||||||
|
'percent': percent,
|
||||||
|
'amount': amount
|
||||||
|
})
|
||||||
|
seen.add(entry_key)
|
||||||
|
except (ValueError, InvalidOperation, IndexError):
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Fallback to simple format
|
||||||
|
if not entries:
|
||||||
|
simple_pattern = self.TVA_PATTERNS[2]
|
||||||
|
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||||
|
try:
|
||||||
|
percent = int(match.group(1))
|
||||||
|
amount = self._parse_decimal(match.group(2))
|
||||||
|
|
||||||
|
if amount and amount > 0:
|
||||||
|
entries.append({
|
||||||
|
'code': 'A',
|
||||||
|
'percent': percent,
|
||||||
|
'amount': amount
|
||||||
|
})
|
||||||
|
break
|
||||||
|
except (ValueError, InvalidOperation):
|
||||||
|
continue
|
||||||
|
|
||||||
|
return entries
|
||||||
|
|
||||||
|
def get_validation_hints(self) -> Dict[str, Any]:
|
||||||
|
"""Return BRICK-specific validation hints."""
|
||||||
|
return {
|
||||||
|
"has_multi_rate_tva": False,
|
||||||
|
"card_equals_total": False,
|
||||||
|
"has_client_cui": False,
|
||||||
|
"has_efactura": False,
|
||||||
|
"is_non_vat_payer": False,
|
||||||
|
}
|
||||||
118
backend/modules/data_entry/services/ocr/profiles/dedeman.py
Normal file
118
backend/modules/data_entry/services/ocr/profiles/dedeman.py
Normal file
@@ -0,0 +1,118 @@
|
|||||||
|
"""
|
||||||
|
DEDEMAN store profile for OCR extraction.
|
||||||
|
|
||||||
|
Dedeman receipts may include e-factura information and use standard TVA format.
|
||||||
|
Large DIY retailer in Romania.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
from decimal import Decimal, InvalidOperation
|
||||||
|
from typing import List, Dict, Any
|
||||||
|
|
||||||
|
from .base import BaseStoreProfile
|
||||||
|
from . import ProfileRegistry
|
||||||
|
|
||||||
|
|
||||||
|
@ProfileRegistry.register
|
||||||
|
class DedemanProfile(BaseStoreProfile):
|
||||||
|
"""
|
||||||
|
DEDEMAN SRL - standard TVA with e-factura support.
|
||||||
|
|
||||||
|
Key characteristics:
|
||||||
|
- Standard TVA format
|
||||||
|
- May include e-factura reference number
|
||||||
|
- Professional receipts for construction materials
|
||||||
|
"""
|
||||||
|
|
||||||
|
CUI_LIST = ["2816464"]
|
||||||
|
NAME_PATTERNS = ["DEDEMAN", "DEDEMAN SRL", "OEDEMAN", "D3DEMAN"] # OCR variants
|
||||||
|
STORE_NAME = "DEDEMAN SRL"
|
||||||
|
|
||||||
|
# Standard TVA patterns (flexible - accepts any rate)
|
||||||
|
TVA_PATTERNS = [
|
||||||
|
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
|
||||||
|
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||||
|
# "A - XX,XX% = YY,YY"
|
||||||
|
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||||
|
# "TVA (XX%) YY,YY"
|
||||||
|
r'TVA\s*\(?\s*(\d{1,2})\s*%\s*\)?\s*:?\s*([\d.,]+)',
|
||||||
|
]
|
||||||
|
|
||||||
|
# E-factura pattern for reference extraction
|
||||||
|
EFACTURA_PATTERN = r'e-?factura\s*:?\s*([A-Z0-9]+)'
|
||||||
|
|
||||||
|
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||||
|
"""
|
||||||
|
Extract Dedeman-specific TVA entries.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of TVA entries with code, percent, and amount
|
||||||
|
"""
|
||||||
|
entries = []
|
||||||
|
seen = set()
|
||||||
|
|
||||||
|
# Try coded patterns first
|
||||||
|
for pattern in self.TVA_PATTERNS[:2]:
|
||||||
|
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||||
|
try:
|
||||||
|
code = match.group(1).upper()
|
||||||
|
percent = int(match.group(2))
|
||||||
|
amount = self._parse_decimal(match.group(3))
|
||||||
|
|
||||||
|
if amount and amount > 0:
|
||||||
|
entry_key = (code, percent)
|
||||||
|
if entry_key not in seen:
|
||||||
|
entries.append({
|
||||||
|
'code': code,
|
||||||
|
'percent': percent,
|
||||||
|
'amount': amount
|
||||||
|
})
|
||||||
|
seen.add(entry_key)
|
||||||
|
except (ValueError, InvalidOperation, IndexError):
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Fallback to simple format
|
||||||
|
if not entries:
|
||||||
|
simple_pattern = self.TVA_PATTERNS[2]
|
||||||
|
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||||
|
try:
|
||||||
|
percent = int(match.group(1))
|
||||||
|
amount = self._parse_decimal(match.group(2))
|
||||||
|
|
||||||
|
if amount and amount > 0:
|
||||||
|
entries.append({
|
||||||
|
'code': 'A',
|
||||||
|
'percent': percent,
|
||||||
|
'amount': amount
|
||||||
|
})
|
||||||
|
break
|
||||||
|
except (ValueError, InvalidOperation):
|
||||||
|
continue
|
||||||
|
|
||||||
|
return entries
|
||||||
|
|
||||||
|
def extract_efactura_reference(self, text: str) -> str | None:
|
||||||
|
"""
|
||||||
|
Extract e-factura reference number if present.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
E-factura reference string or None
|
||||||
|
"""
|
||||||
|
match = re.search(self.EFACTURA_PATTERN, text, re.IGNORECASE)
|
||||||
|
return match.group(1) if match else None
|
||||||
|
|
||||||
|
def get_validation_hints(self) -> Dict[str, Any]:
|
||||||
|
"""Return Dedeman-specific validation hints."""
|
||||||
|
return {
|
||||||
|
"has_multi_rate_tva": False,
|
||||||
|
"card_equals_total": False,
|
||||||
|
"has_client_cui": False,
|
||||||
|
"has_efactura": True,
|
||||||
|
"is_non_vat_payer": False,
|
||||||
|
}
|
||||||
@@ -0,0 +1,102 @@
|
|||||||
|
"""
|
||||||
|
ELECTROBERING S.R.L. store profile for OCR extraction.
|
||||||
|
|
||||||
|
Electronics and home supplies store.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
from decimal import Decimal, InvalidOperation
|
||||||
|
from typing import List, Dict, Any
|
||||||
|
|
||||||
|
from .base import BaseStoreProfile
|
||||||
|
from . import ProfileRegistry
|
||||||
|
|
||||||
|
|
||||||
|
@ProfileRegistry.register
|
||||||
|
class ElectroberingProfile(BaseStoreProfile):
|
||||||
|
"""
|
||||||
|
ELECTROBERING S.R.L. - standard TVA profile.
|
||||||
|
|
||||||
|
Key characteristics:
|
||||||
|
- Standard TVA format (single rate, any percentage)
|
||||||
|
- Electronics and home supplies
|
||||||
|
- May have client CUI for B2B purchases
|
||||||
|
- CARD payment typical
|
||||||
|
"""
|
||||||
|
|
||||||
|
CUI_LIST = ["2744937"]
|
||||||
|
NAME_PATTERNS = ["ELECTROBERING", "ELECTR0BERING", "ELECTROBERING SRL"]
|
||||||
|
STORE_NAME = "ELECTROBERING S.R.L."
|
||||||
|
|
||||||
|
# Standard TVA patterns (flexible - accepts any rate)
|
||||||
|
TVA_PATTERNS = [
|
||||||
|
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
|
||||||
|
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||||
|
# "A - XX,XX% = YY,YY"
|
||||||
|
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||||
|
# "TVA XX% YY,YY" (simple format without code)
|
||||||
|
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||||
|
]
|
||||||
|
|
||||||
|
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||||
|
"""
|
||||||
|
Extract TVA entries from receipt text.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of TVA entries with code, percent, and amount
|
||||||
|
"""
|
||||||
|
entries = []
|
||||||
|
seen = set()
|
||||||
|
|
||||||
|
# Try coded patterns first
|
||||||
|
for pattern in self.TVA_PATTERNS[:2]:
|
||||||
|
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||||
|
try:
|
||||||
|
code = match.group(1).upper()
|
||||||
|
percent = int(match.group(2))
|
||||||
|
amount = self._parse_decimal(match.group(3))
|
||||||
|
|
||||||
|
if amount and amount > 0:
|
||||||
|
entry_key = (code, percent)
|
||||||
|
if entry_key not in seen:
|
||||||
|
entries.append({
|
||||||
|
'code': code,
|
||||||
|
'percent': percent,
|
||||||
|
'amount': amount
|
||||||
|
})
|
||||||
|
seen.add(entry_key)
|
||||||
|
except (ValueError, InvalidOperation, IndexError):
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Fallback to simple format
|
||||||
|
if not entries:
|
||||||
|
simple_pattern = self.TVA_PATTERNS[2]
|
||||||
|
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||||
|
try:
|
||||||
|
percent = int(match.group(1))
|
||||||
|
amount = self._parse_decimal(match.group(2))
|
||||||
|
|
||||||
|
if amount and amount > 0:
|
||||||
|
entries.append({
|
||||||
|
'code': 'A',
|
||||||
|
'percent': percent,
|
||||||
|
'amount': amount
|
||||||
|
})
|
||||||
|
break
|
||||||
|
except (ValueError, InvalidOperation):
|
||||||
|
continue
|
||||||
|
|
||||||
|
return entries
|
||||||
|
|
||||||
|
def get_validation_hints(self) -> Dict[str, Any]:
|
||||||
|
"""Return ELECTROBERING-specific validation hints."""
|
||||||
|
return {
|
||||||
|
"has_multi_rate_tva": False,
|
||||||
|
"card_equals_total": True,
|
||||||
|
"has_client_cui": True, # May have client CUI for B2B
|
||||||
|
"has_efactura": False,
|
||||||
|
"is_non_vat_payer": False,
|
||||||
|
}
|
||||||
103
backend/modules/data_entry/services/ocr/profiles/gama_ink.py
Normal file
103
backend/modules/data_entry/services/ocr/profiles/gama_ink.py
Normal file
@@ -0,0 +1,103 @@
|
|||||||
|
"""
|
||||||
|
GAMA INK SERVICE SRL store profile for OCR extraction.
|
||||||
|
|
||||||
|
Toner refill and printer supplies store.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
from decimal import Decimal, InvalidOperation
|
||||||
|
from typing import List, Dict, Any
|
||||||
|
|
||||||
|
from .base import BaseStoreProfile
|
||||||
|
from . import ProfileRegistry
|
||||||
|
|
||||||
|
|
||||||
|
@ProfileRegistry.register
|
||||||
|
class GamaInkProfile(BaseStoreProfile):
|
||||||
|
"""
|
||||||
|
GAMA INK SERVICE SRL - standard TVA profile.
|
||||||
|
|
||||||
|
Key characteristics:
|
||||||
|
- Standard TVA format (single rate, any percentage)
|
||||||
|
- Service-based (toner refill, printer supplies)
|
||||||
|
- CARD payment typical
|
||||||
|
"""
|
||||||
|
|
||||||
|
CUI_LIST = ["17741882"]
|
||||||
|
NAME_PATTERNS = ["GAMA INK", "GAMA", "GAMAINK", "GAMA INK SERVICE"]
|
||||||
|
STORE_NAME = "GAMA INK SERVICE SRL"
|
||||||
|
|
||||||
|
# Standard TVA patterns (flexible - accepts any rate)
|
||||||
|
TVA_PATTERNS = [
|
||||||
|
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
|
||||||
|
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||||
|
# "A - XX,XX% = YY,YY"
|
||||||
|
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||||
|
# "TVA XX% YY,YY" (simple format without code)
|
||||||
|
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||||
|
# "TVA: YY,YY" (amount only, percent inferred)
|
||||||
|
r'TVA\s*:?\s*([\d.,]+)\s*(?:LEI|RON)?',
|
||||||
|
]
|
||||||
|
|
||||||
|
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||||
|
"""
|
||||||
|
Extract TVA entries from receipt text.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of TVA entries with code, percent, and amount
|
||||||
|
"""
|
||||||
|
entries = []
|
||||||
|
seen = set()
|
||||||
|
|
||||||
|
# Try coded patterns first (have both code and percent)
|
||||||
|
for pattern in self.TVA_PATTERNS[:2]:
|
||||||
|
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||||
|
try:
|
||||||
|
code = match.group(1).upper()
|
||||||
|
percent = int(match.group(2))
|
||||||
|
amount = self._parse_decimal(match.group(3))
|
||||||
|
|
||||||
|
if amount and amount > 0:
|
||||||
|
entry_key = (code, percent)
|
||||||
|
if entry_key not in seen:
|
||||||
|
entries.append({
|
||||||
|
'code': code,
|
||||||
|
'percent': percent,
|
||||||
|
'amount': amount
|
||||||
|
})
|
||||||
|
seen.add(entry_key)
|
||||||
|
except (ValueError, InvalidOperation, IndexError):
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Fallback to simple format (percent + amount without code)
|
||||||
|
if not entries:
|
||||||
|
simple_pattern = self.TVA_PATTERNS[2]
|
||||||
|
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||||
|
try:
|
||||||
|
percent = int(match.group(1))
|
||||||
|
amount = self._parse_decimal(match.group(2))
|
||||||
|
|
||||||
|
if amount and amount > 0:
|
||||||
|
entries.append({
|
||||||
|
'code': 'A',
|
||||||
|
'percent': percent,
|
||||||
|
'amount': amount
|
||||||
|
})
|
||||||
|
break
|
||||||
|
except (ValueError, InvalidOperation):
|
||||||
|
continue
|
||||||
|
|
||||||
|
return entries
|
||||||
|
|
||||||
|
def get_validation_hints(self) -> Dict[str, Any]:
|
||||||
|
"""Return GAMA INK-specific validation hints."""
|
||||||
|
return {
|
||||||
|
"has_multi_rate_tva": False,
|
||||||
|
"card_equals_total": True,
|
||||||
|
"has_client_cui": False,
|
||||||
|
"has_efactura": False,
|
||||||
|
"is_non_vat_payer": False,
|
||||||
|
}
|
||||||
@@ -0,0 +1,53 @@
|
|||||||
|
"""
|
||||||
|
KINETERRA store profile for OCR extraction.
|
||||||
|
|
||||||
|
Kineterra is a non-VAT payer (neplătitor de TVA).
|
||||||
|
Receipts don't include TVA breakdown.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from typing import List, Dict, Any
|
||||||
|
|
||||||
|
from .base import BaseStoreProfile
|
||||||
|
from . import ProfileRegistry
|
||||||
|
|
||||||
|
|
||||||
|
@ProfileRegistry.register
|
||||||
|
class KineterraProfile(BaseStoreProfile):
|
||||||
|
"""
|
||||||
|
KINETERRA CONCEPT SRL - non-VAT payer profile.
|
||||||
|
|
||||||
|
Key characteristics:
|
||||||
|
- Non-VAT payer (neplătitor de TVA)
|
||||||
|
- No TVA breakdown on receipts
|
||||||
|
- Total amount has no TVA component
|
||||||
|
"""
|
||||||
|
|
||||||
|
CUI_LIST = ["31180432"]
|
||||||
|
NAME_PATTERNS = ["KINETERRA", "KINETERRA CONCEPT", "K1NETERRA"] # OCR variants
|
||||||
|
STORE_NAME = "KINETERRA CONCEPT SRL"
|
||||||
|
|
||||||
|
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||||
|
"""
|
||||||
|
Extract TVA entries - returns empty for non-VAT payer.
|
||||||
|
|
||||||
|
Kineterra is a non-VAT payer, so no TVA entries are expected.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt (unused)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Empty list (non-VAT payer has no TVA)
|
||||||
|
"""
|
||||||
|
# Non-VAT payer - no TVA entries
|
||||||
|
return []
|
||||||
|
|
||||||
|
def get_validation_hints(self) -> Dict[str, Any]:
|
||||||
|
"""Return Kineterra-specific validation hints."""
|
||||||
|
return {
|
||||||
|
"has_multi_rate_tva": False,
|
||||||
|
"card_equals_total": False,
|
||||||
|
"has_client_cui": False,
|
||||||
|
"has_efactura": False,
|
||||||
|
"is_non_vat_payer": True,
|
||||||
|
"tva_pattern": "none",
|
||||||
|
}
|
||||||
93
backend/modules/data_entry/services/ocr/profiles/lidl.py
Normal file
93
backend/modules/data_entry/services/ocr/profiles/lidl.py
Normal file
@@ -0,0 +1,93 @@
|
|||||||
|
"""
|
||||||
|
LIDL store profile for OCR extraction.
|
||||||
|
|
||||||
|
Lidl receipts have a specific TVA format without hyphen/colon separators:
|
||||||
|
TOTAL TVA 9,84
|
||||||
|
TVA A 21,00% 7,71
|
||||||
|
TVA B 11,00% 2,13
|
||||||
|
|
||||||
|
This profile handles multi-rate TVA extraction for Lidl receipts.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
from decimal import Decimal, InvalidOperation
|
||||||
|
from typing import List, Dict, Any
|
||||||
|
|
||||||
|
from .base import BaseStoreProfile
|
||||||
|
from . import ProfileRegistry
|
||||||
|
|
||||||
|
|
||||||
|
@ProfileRegistry.register
|
||||||
|
class LidlProfile(BaseStoreProfile):
|
||||||
|
"""
|
||||||
|
LIDL DISCOUNT S.R.L. - multi-rate TVA profile.
|
||||||
|
|
||||||
|
Key characteristics:
|
||||||
|
- Multi-rate TVA (codes A, B, C, D with any percentage - patterns are flexible)
|
||||||
|
- TVA format: "TVA A XX,XX% YY,YY" (code + percent + amount on same line)
|
||||||
|
- Supports historical rates (19%, 9%, 5%) and current rates (21%, 11%)
|
||||||
|
- CARD payment usually equals total
|
||||||
|
- No client CUI on receipts
|
||||||
|
"""
|
||||||
|
|
||||||
|
CUI_LIST = ["22891860"]
|
||||||
|
NAME_PATTERNS = ["LIDL", "LDL", "L1DL", "LIDL DISCOUNT"] # OCR variants
|
||||||
|
STORE_NAME = "LIDL DISCOUNT S.R.L."
|
||||||
|
|
||||||
|
# Lidl-specific TVA patterns
|
||||||
|
# Format: "TVA A 21,00% 7,71" (code + percent + amount on same line)
|
||||||
|
TVA_PATTERNS = [
|
||||||
|
# Primary: "TVA A 21,00% 7.71" with various spacing
|
||||||
|
r'T[VU][AR]\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
|
||||||
|
# With backslash OCR artifact: "TVA A \21,00% 7.71"
|
||||||
|
r'T[VU][AR]\s+([A-D])\s+\\?(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
|
||||||
|
# IVA variant (rare OCR misread)
|
||||||
|
r'IVA\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
|
||||||
|
]
|
||||||
|
|
||||||
|
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||||
|
"""
|
||||||
|
Extract Lidl-specific TVA entries.
|
||||||
|
|
||||||
|
Handles multiple TVA rates (A, B, C, D) commonly found on Lidl receipts.
|
||||||
|
Uses deduplication to avoid counting the same entry twice from different patterns.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of TVA entries with code, percent, and amount
|
||||||
|
"""
|
||||||
|
entries = []
|
||||||
|
seen = set() # Deduplication key: (code, percent)
|
||||||
|
|
||||||
|
for pattern in self.TVA_PATTERNS:
|
||||||
|
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||||
|
try:
|
||||||
|
code = match.group(1).upper()
|
||||||
|
percent = int(match.group(2))
|
||||||
|
amount = self._parse_decimal(match.group(3))
|
||||||
|
|
||||||
|
if amount and amount > 0:
|
||||||
|
entry_key = (code, percent)
|
||||||
|
if entry_key not in seen:
|
||||||
|
entries.append({
|
||||||
|
'code': code,
|
||||||
|
'percent': percent,
|
||||||
|
'amount': amount
|
||||||
|
})
|
||||||
|
seen.add(entry_key)
|
||||||
|
except (ValueError, InvalidOperation):
|
||||||
|
continue
|
||||||
|
|
||||||
|
return entries
|
||||||
|
|
||||||
|
def get_validation_hints(self) -> Dict[str, Any]:
|
||||||
|
"""Return Lidl-specific validation hints."""
|
||||||
|
return {
|
||||||
|
"has_multi_rate_tva": True,
|
||||||
|
"card_equals_total": True,
|
||||||
|
"has_client_cui": False,
|
||||||
|
"has_efactura": False,
|
||||||
|
"is_non_vat_payer": False,
|
||||||
|
}
|
||||||
99
backend/modules/data_entry/services/ocr/profiles/omv.py
Normal file
99
backend/modules/data_entry/services/ocr/profiles/omv.py
Normal file
@@ -0,0 +1,99 @@
|
|||||||
|
"""
|
||||||
|
OMV Petrom store profile for OCR extraction.
|
||||||
|
|
||||||
|
OMV receipts typically include client CUI and use standard TVA format.
|
||||||
|
Common at gas stations with fuel purchases.
|
||||||
|
|
||||||
|
Date format: YYYY. MM. DD with spaces (e.g., "2025. 08. 14")
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
from datetime import date
|
||||||
|
from decimal import Decimal, InvalidOperation
|
||||||
|
from typing import List, Dict, Any, Tuple, Optional
|
||||||
|
|
||||||
|
from .base import BaseStoreProfile
|
||||||
|
from . import ProfileRegistry
|
||||||
|
|
||||||
|
|
||||||
|
@ProfileRegistry.register
|
||||||
|
class OMVProfile(BaseStoreProfile):
|
||||||
|
"""
|
||||||
|
OMV PETROM MARKETING S.R.L. - standard TVA with client CUI.
|
||||||
|
|
||||||
|
Key characteristics:
|
||||||
|
- Standard TVA format (usually single rate, any percentage)
|
||||||
|
- Includes client CUI on receipt (for business purchases)
|
||||||
|
- TVA table format: "A-XX,XX% base_amount tva_amount"
|
||||||
|
- Supports historical rates (19%) and current rates (21%)
|
||||||
|
- Date format: YYYY. MM. DD (with spaces)
|
||||||
|
"""
|
||||||
|
|
||||||
|
CUI_LIST = ["11201891"]
|
||||||
|
NAME_PATTERNS = ["OMV", "PETROM", "OMV PETROM", "0MV"] # OCR variants
|
||||||
|
STORE_NAME = "OMV PETROM MARKETING S.R.L."
|
||||||
|
|
||||||
|
# OMV TVA table pattern: "A-19,00% 285,66 49,58" (code-percent base tva)
|
||||||
|
TVA_TABLE_PATTERN = r'([A-D])\s*[-:]\s*(\d{1,2})[.,]\d{2}\s*%\s+([\d.,]+)\s+([\d.,]+)'
|
||||||
|
|
||||||
|
# Standard TVA pattern fallback
|
||||||
|
TVA_STANDARD_PATTERN = r'TVA\s*:?\s*([\d.,]+)'
|
||||||
|
|
||||||
|
# OMV specific: prioritize YYYY. MM. DD format with spaces
|
||||||
|
DATE_PATTERNS_OCR_SPACES = [
|
||||||
|
# YYYY. MM. DD with time (OMV format)
|
||||||
|
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})\s+\d{2}:\d{2}', 0.98, 'ymd'),
|
||||||
|
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})', 0.95, 'ymd'),
|
||||||
|
# Fallback to DD. MM. YYYY
|
||||||
|
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})\s+\d{2}:\d{2}', 0.92, 'dmy'),
|
||||||
|
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})', 0.85, 'dmy'),
|
||||||
|
]
|
||||||
|
|
||||||
|
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||||
|
"""
|
||||||
|
Extract OMV-specific TVA entries.
|
||||||
|
|
||||||
|
OMV receipts often show TVA in table format with base and TVA amounts.
|
||||||
|
Falls back to standard extraction if table format not found.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of TVA entries with code, percent, and amount
|
||||||
|
"""
|
||||||
|
entries = []
|
||||||
|
seen = set()
|
||||||
|
|
||||||
|
# Try table format first (more accurate)
|
||||||
|
for match in re.finditer(self.TVA_TABLE_PATTERN, text, re.IGNORECASE):
|
||||||
|
try:
|
||||||
|
code = match.group(1).upper()
|
||||||
|
percent = int(match.group(2))
|
||||||
|
# TVA amount is the second number (smaller one)
|
||||||
|
tva_amount = self._parse_decimal(match.group(4))
|
||||||
|
|
||||||
|
if tva_amount and tva_amount > 0:
|
||||||
|
entry_key = (code, percent)
|
||||||
|
if entry_key not in seen:
|
||||||
|
entries.append({
|
||||||
|
'code': code,
|
||||||
|
'percent': percent,
|
||||||
|
'amount': tva_amount
|
||||||
|
})
|
||||||
|
seen.add(entry_key)
|
||||||
|
except (ValueError, InvalidOperation):
|
||||||
|
continue
|
||||||
|
|
||||||
|
return entries
|
||||||
|
|
||||||
|
def get_validation_hints(self) -> Dict[str, Any]:
|
||||||
|
"""Return OMV-specific validation hints."""
|
||||||
|
return {
|
||||||
|
"has_multi_rate_tva": False,
|
||||||
|
"card_equals_total": False,
|
||||||
|
"has_client_cui": True,
|
||||||
|
"has_efactura": False,
|
||||||
|
"is_non_vat_payer": False,
|
||||||
|
"tva_table_format": True,
|
||||||
|
}
|
||||||
101
backend/modules/data_entry/services/ocr/profiles/pictus_velum.py
Normal file
101
backend/modules/data_entry/services/ocr/profiles/pictus_velum.py
Normal file
@@ -0,0 +1,101 @@
|
|||||||
|
"""
|
||||||
|
PICTUS VELUM SRL store profile for OCR extraction.
|
||||||
|
|
||||||
|
Office supplies and stationery store.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
from decimal import Decimal, InvalidOperation
|
||||||
|
from typing import List, Dict, Any
|
||||||
|
|
||||||
|
from .base import BaseStoreProfile
|
||||||
|
from . import ProfileRegistry
|
||||||
|
|
||||||
|
|
||||||
|
@ProfileRegistry.register
|
||||||
|
class PictusVelumProfile(BaseStoreProfile):
|
||||||
|
"""
|
||||||
|
PICTUS VELUM SRL - standard TVA profile.
|
||||||
|
|
||||||
|
Key characteristics:
|
||||||
|
- Standard TVA format (single rate, any percentage)
|
||||||
|
- Office supplies and stationery (rechizite)
|
||||||
|
- CARD payment typical
|
||||||
|
"""
|
||||||
|
|
||||||
|
CUI_LIST = ["39634534"]
|
||||||
|
NAME_PATTERNS = ["PICTUS", "PICTUS VELUM", "P1CTUS", "PICTUS VELUM SRL"]
|
||||||
|
STORE_NAME = "PICTUS VELUM SRL"
|
||||||
|
|
||||||
|
# Standard TVA patterns (flexible - accepts any rate)
|
||||||
|
TVA_PATTERNS = [
|
||||||
|
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
|
||||||
|
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||||
|
# "A - XX,XX% = YY,YY"
|
||||||
|
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||||
|
# "TVA XX% YY,YY" (simple format without code)
|
||||||
|
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||||
|
]
|
||||||
|
|
||||||
|
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||||
|
"""
|
||||||
|
Extract TVA entries from receipt text.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of TVA entries with code, percent, and amount
|
||||||
|
"""
|
||||||
|
entries = []
|
||||||
|
seen = set()
|
||||||
|
|
||||||
|
# Try coded patterns first
|
||||||
|
for pattern in self.TVA_PATTERNS[:2]:
|
||||||
|
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||||
|
try:
|
||||||
|
code = match.group(1).upper()
|
||||||
|
percent = int(match.group(2))
|
||||||
|
amount = self._parse_decimal(match.group(3))
|
||||||
|
|
||||||
|
if amount and amount > 0:
|
||||||
|
entry_key = (code, percent)
|
||||||
|
if entry_key not in seen:
|
||||||
|
entries.append({
|
||||||
|
'code': code,
|
||||||
|
'percent': percent,
|
||||||
|
'amount': amount
|
||||||
|
})
|
||||||
|
seen.add(entry_key)
|
||||||
|
except (ValueError, InvalidOperation, IndexError):
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Fallback to simple format
|
||||||
|
if not entries:
|
||||||
|
simple_pattern = self.TVA_PATTERNS[2]
|
||||||
|
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||||
|
try:
|
||||||
|
percent = int(match.group(1))
|
||||||
|
amount = self._parse_decimal(match.group(2))
|
||||||
|
|
||||||
|
if amount and amount > 0:
|
||||||
|
entries.append({
|
||||||
|
'code': 'A',
|
||||||
|
'percent': percent,
|
||||||
|
'amount': amount
|
||||||
|
})
|
||||||
|
break
|
||||||
|
except (ValueError, InvalidOperation):
|
||||||
|
continue
|
||||||
|
|
||||||
|
return entries
|
||||||
|
|
||||||
|
def get_validation_hints(self) -> Dict[str, Any]:
|
||||||
|
"""Return PICTUS VELUM-specific validation hints."""
|
||||||
|
return {
|
||||||
|
"has_multi_rate_tva": False,
|
||||||
|
"card_equals_total": True,
|
||||||
|
"has_client_cui": False,
|
||||||
|
"has_efactura": False,
|
||||||
|
"is_non_vat_payer": False,
|
||||||
|
}
|
||||||
111
backend/modules/data_entry/services/ocr/profiles/socar.py
Normal file
111
backend/modules/data_entry/services/ocr/profiles/socar.py
Normal file
@@ -0,0 +1,111 @@
|
|||||||
|
"""
|
||||||
|
SOCAR Petroleum store profile for OCR extraction.
|
||||||
|
|
||||||
|
SOCAR receipts are similar to OMV - gas station with client CUI support.
|
||||||
|
Date format may use YYYY. MM. DD with spaces.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
from datetime import date
|
||||||
|
from decimal import Decimal, InvalidOperation
|
||||||
|
from typing import List, Dict, Any, Tuple, Optional
|
||||||
|
|
||||||
|
from .base import BaseStoreProfile
|
||||||
|
from . import ProfileRegistry
|
||||||
|
|
||||||
|
|
||||||
|
@ProfileRegistry.register
|
||||||
|
class SocarProfile(BaseStoreProfile):
|
||||||
|
"""
|
||||||
|
SOCAR PETROLEUM S.A. - standard TVA with client CUI.
|
||||||
|
|
||||||
|
Key characteristics:
|
||||||
|
- Standard TVA format (usually single rate)
|
||||||
|
- Includes client CUI on receipt (for business purchases)
|
||||||
|
- Similar format to OMV/Petrom
|
||||||
|
- Date format may use YYYY. MM. DD (with spaces)
|
||||||
|
"""
|
||||||
|
|
||||||
|
CUI_LIST = ["12546600"]
|
||||||
|
NAME_PATTERNS = ["SOCAR", "S0CAR", "SOCAR PETROLEUM"] # OCR variants
|
||||||
|
STORE_NAME = "SOCAR PETROLEUM S.A."
|
||||||
|
|
||||||
|
# Standard TVA patterns for gas stations
|
||||||
|
TVA_PATTERNS = [
|
||||||
|
# Table format: "A-19,00% 285,66 49,58"
|
||||||
|
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]\d{2}\s*%\s+([\d.,]+)\s+([\d.,]+)',
|
||||||
|
# Simple format: "TVA 19% 49,58"
|
||||||
|
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||||
|
]
|
||||||
|
|
||||||
|
# Gas stations may use YYYY. MM. DD format
|
||||||
|
DATE_PATTERNS_OCR_SPACES = [
|
||||||
|
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})\s+\d{2}:\d{2}', 0.98, 'ymd'),
|
||||||
|
(r'(\d{4})[.,]\s*(\d{2})[.,]\s*(\d{2})', 0.95, 'ymd'),
|
||||||
|
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})\s+\d{2}:\d{2}', 0.92, 'dmy'),
|
||||||
|
(r'(\d{2})[.,]\s*(\d{2})[.,]\s*(\d{4})', 0.85, 'dmy'),
|
||||||
|
]
|
||||||
|
|
||||||
|
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||||
|
"""
|
||||||
|
Extract SOCAR-specific TVA entries.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of TVA entries with code, percent, and amount
|
||||||
|
"""
|
||||||
|
entries = []
|
||||||
|
seen = set()
|
||||||
|
|
||||||
|
# Try table format first
|
||||||
|
table_pattern = self.TVA_PATTERNS[0]
|
||||||
|
for match in re.finditer(table_pattern, text, re.IGNORECASE):
|
||||||
|
try:
|
||||||
|
code = match.group(1).upper()
|
||||||
|
percent = int(match.group(2))
|
||||||
|
tva_amount = self._parse_decimal(match.group(4))
|
||||||
|
|
||||||
|
if tva_amount and tva_amount > 0:
|
||||||
|
entry_key = (code, percent)
|
||||||
|
if entry_key not in seen:
|
||||||
|
entries.append({
|
||||||
|
'code': code,
|
||||||
|
'percent': percent,
|
||||||
|
'amount': tva_amount
|
||||||
|
})
|
||||||
|
seen.add(entry_key)
|
||||||
|
except (ValueError, InvalidOperation):
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Fallback to simple format if no table entries found
|
||||||
|
if not entries:
|
||||||
|
simple_pattern = self.TVA_PATTERNS[1]
|
||||||
|
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||||
|
try:
|
||||||
|
percent = int(match.group(1))
|
||||||
|
amount = self._parse_decimal(match.group(2))
|
||||||
|
|
||||||
|
if amount and amount > 0:
|
||||||
|
# Default to code 'A' for simple format
|
||||||
|
entries.append({
|
||||||
|
'code': 'A',
|
||||||
|
'percent': percent,
|
||||||
|
'amount': amount
|
||||||
|
})
|
||||||
|
break # Only take first match for simple format
|
||||||
|
except (ValueError, InvalidOperation):
|
||||||
|
continue
|
||||||
|
|
||||||
|
return entries
|
||||||
|
|
||||||
|
def get_validation_hints(self) -> Dict[str, Any]:
|
||||||
|
"""Return SOCAR-specific validation hints."""
|
||||||
|
return {
|
||||||
|
"has_multi_rate_tva": False,
|
||||||
|
"card_equals_total": False,
|
||||||
|
"has_client_cui": True,
|
||||||
|
"has_efactura": False,
|
||||||
|
"is_non_vat_payer": False,
|
||||||
|
}
|
||||||
@@ -0,0 +1,112 @@
|
|||||||
|
"""
|
||||||
|
STEPOUT MARKET SRL store profile for OCR extraction.
|
||||||
|
|
||||||
|
Bookstore with reduced TVA rate (5% for books in Romania).
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
from decimal import Decimal, InvalidOperation
|
||||||
|
from typing import List, Dict, Any
|
||||||
|
|
||||||
|
from .base import BaseStoreProfile
|
||||||
|
from . import ProfileRegistry
|
||||||
|
|
||||||
|
|
||||||
|
@ProfileRegistry.register
|
||||||
|
class StepoutMarketProfile(BaseStoreProfile):
|
||||||
|
"""
|
||||||
|
STEPOUT MARKET SRL - reduced TVA rate profile (books).
|
||||||
|
|
||||||
|
Key characteristics:
|
||||||
|
- Reduced TVA rate: 5% for books (cărți qualification in Romania)
|
||||||
|
- May also have standard rates for non-book items
|
||||||
|
- Patterns are flexible to accept ANY TVA rate
|
||||||
|
- CARD payment typical
|
||||||
|
"""
|
||||||
|
|
||||||
|
CUI_LIST = ["35532655"]
|
||||||
|
NAME_PATTERNS = ["STEPOUT", "STEPOUT MARKET", "STEP0UT", "STEPOUT MARKET SRL"]
|
||||||
|
STORE_NAME = "STEPOUT MARKET SRL"
|
||||||
|
|
||||||
|
# TVA patterns (flexible - accepts any rate including 5%)
|
||||||
|
TVA_PATTERNS = [
|
||||||
|
# "TVA A: 5% = YY,YY" or "TVA-A 5% YY,YY" (coded format)
|
||||||
|
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||||
|
# "A - 5,00% = YY,YY" (table format)
|
||||||
|
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||||
|
# "TVA 5% YY,YY" (simple format - common for single rate)
|
||||||
|
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||||
|
# "TVA 5,00%: YY,YY" (percent with colon)
|
||||||
|
r'TVA\s+(\d{1,2})[.,]\d{2}\s*%\s*:?\s*([\d.,]+)',
|
||||||
|
]
|
||||||
|
|
||||||
|
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||||
|
"""
|
||||||
|
Extract TVA entries from receipt text.
|
||||||
|
|
||||||
|
Stepout Market primarily sells books which have 5% TVA in Romania.
|
||||||
|
The patterns are generic and will extract whatever rate is on the receipt.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of TVA entries with code, percent, and amount
|
||||||
|
"""
|
||||||
|
entries = []
|
||||||
|
seen = set()
|
||||||
|
|
||||||
|
# Try coded patterns first (have code letter)
|
||||||
|
for pattern in self.TVA_PATTERNS[:2]:
|
||||||
|
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||||
|
try:
|
||||||
|
code = match.group(1).upper()
|
||||||
|
percent = int(match.group(2))
|
||||||
|
amount = self._parse_decimal(match.group(3))
|
||||||
|
|
||||||
|
if amount and amount > 0:
|
||||||
|
entry_key = (code, percent)
|
||||||
|
if entry_key not in seen:
|
||||||
|
entries.append({
|
||||||
|
'code': code,
|
||||||
|
'percent': percent,
|
||||||
|
'amount': amount
|
||||||
|
})
|
||||||
|
seen.add(entry_key)
|
||||||
|
except (ValueError, InvalidOperation, IndexError):
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Fallback to simple format (no code letter, just percent + amount)
|
||||||
|
if not entries:
|
||||||
|
for pattern in self.TVA_PATTERNS[2:]:
|
||||||
|
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||||
|
try:
|
||||||
|
percent = int(match.group(1))
|
||||||
|
amount = self._parse_decimal(match.group(2))
|
||||||
|
|
||||||
|
if amount and amount > 0:
|
||||||
|
# Default to code 'A' for simple format
|
||||||
|
entries.append({
|
||||||
|
'code': 'A',
|
||||||
|
'percent': percent,
|
||||||
|
'amount': amount
|
||||||
|
})
|
||||||
|
break # Only take first match for simple format
|
||||||
|
except (ValueError, InvalidOperation):
|
||||||
|
continue
|
||||||
|
if entries:
|
||||||
|
break
|
||||||
|
|
||||||
|
return entries
|
||||||
|
|
||||||
|
def get_validation_hints(self) -> Dict[str, Any]:
|
||||||
|
"""Return STEPOUT MARKET-specific validation hints."""
|
||||||
|
return {
|
||||||
|
"has_multi_rate_tva": False,
|
||||||
|
"card_equals_total": True,
|
||||||
|
"has_client_cui": True, # May have client CUI
|
||||||
|
"has_efactura": False,
|
||||||
|
"is_non_vat_payer": False,
|
||||||
|
"typical_tva_rate": 5, # Books have 5% TVA in Romania
|
||||||
|
"product_category": "books",
|
||||||
|
}
|
||||||
@@ -0,0 +1,103 @@
|
|||||||
|
"""
|
||||||
|
UNLIMITED KEYS S.R.L. store profile for OCR extraction.
|
||||||
|
|
||||||
|
Key duplication service. Notable for CASH (NUMERAR) payments.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import re
|
||||||
|
from decimal import Decimal, InvalidOperation
|
||||||
|
from typing import List, Dict, Any
|
||||||
|
|
||||||
|
from .base import BaseStoreProfile
|
||||||
|
from . import ProfileRegistry
|
||||||
|
|
||||||
|
|
||||||
|
@ProfileRegistry.register
|
||||||
|
class UnlimitedKeysProfile(BaseStoreProfile):
|
||||||
|
"""
|
||||||
|
UNLIMITED KEYS S.R.L. - standard TVA profile with NUMERAR payment.
|
||||||
|
|
||||||
|
Key characteristics:
|
||||||
|
- Standard TVA format (single rate, any percentage)
|
||||||
|
- Key duplication service
|
||||||
|
- NUMERAR (cash) payment common - different from most stores!
|
||||||
|
- May also accept CARD
|
||||||
|
"""
|
||||||
|
|
||||||
|
CUI_LIST = ["18993187"]
|
||||||
|
NAME_PATTERNS = ["UNLIMITED KEYS", "UNLIMITED", "UNL1MITED", "UNLIMITED KEYS SRL"]
|
||||||
|
STORE_NAME = "UNLIMITED KEYS S.R.L."
|
||||||
|
|
||||||
|
# Standard TVA patterns (flexible - accepts any rate)
|
||||||
|
TVA_PATTERNS = [
|
||||||
|
# "TVA A: XX% = YY,YY" or "TVA-A XX% YY,YY"
|
||||||
|
r'TVA\s*[-:]?\s*([A-D])\s*:?\s*(\d{1,2})\s*%\s*[=:]?\s*([\d.,]+)',
|
||||||
|
# "A - XX,XX% = YY,YY"
|
||||||
|
r'([A-D])\s*[-:]\s*(\d{1,2})[.,]?\d{0,2}\s*%\s*[=:]?\s*([\d.,]+)',
|
||||||
|
# "TVA XX% YY,YY" (simple format without code)
|
||||||
|
r'TVA\s+(\d{1,2})\s*%\s*([\d.,]+)',
|
||||||
|
]
|
||||||
|
|
||||||
|
def extract_tva_entries(self, text: str) -> List[dict]:
|
||||||
|
"""
|
||||||
|
Extract TVA entries from receipt text.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Raw OCR text from receipt
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of TVA entries with code, percent, and amount
|
||||||
|
"""
|
||||||
|
entries = []
|
||||||
|
seen = set()
|
||||||
|
|
||||||
|
# Try coded patterns first
|
||||||
|
for pattern in self.TVA_PATTERNS[:2]:
|
||||||
|
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||||
|
try:
|
||||||
|
code = match.group(1).upper()
|
||||||
|
percent = int(match.group(2))
|
||||||
|
amount = self._parse_decimal(match.group(3))
|
||||||
|
|
||||||
|
if amount and amount > 0:
|
||||||
|
entry_key = (code, percent)
|
||||||
|
if entry_key not in seen:
|
||||||
|
entries.append({
|
||||||
|
'code': code,
|
||||||
|
'percent': percent,
|
||||||
|
'amount': amount
|
||||||
|
})
|
||||||
|
seen.add(entry_key)
|
||||||
|
except (ValueError, InvalidOperation, IndexError):
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Fallback to simple format
|
||||||
|
if not entries:
|
||||||
|
simple_pattern = self.TVA_PATTERNS[2]
|
||||||
|
for match in re.finditer(simple_pattern, text, re.IGNORECASE):
|
||||||
|
try:
|
||||||
|
percent = int(match.group(1))
|
||||||
|
amount = self._parse_decimal(match.group(2))
|
||||||
|
|
||||||
|
if amount and amount > 0:
|
||||||
|
entries.append({
|
||||||
|
'code': 'A',
|
||||||
|
'percent': percent,
|
||||||
|
'amount': amount
|
||||||
|
})
|
||||||
|
break
|
||||||
|
except (ValueError, InvalidOperation):
|
||||||
|
continue
|
||||||
|
|
||||||
|
return entries
|
||||||
|
|
||||||
|
def get_validation_hints(self) -> Dict[str, Any]:
|
||||||
|
"""Return UNLIMITED KEYS-specific validation hints."""
|
||||||
|
return {
|
||||||
|
"has_multi_rate_tva": False,
|
||||||
|
"card_equals_total": False, # May be NUMERAR (cash)
|
||||||
|
"has_client_cui": True, # May have client CUI
|
||||||
|
"has_efactura": False,
|
||||||
|
"is_non_vat_payer": False,
|
||||||
|
"common_payment": "NUMERAR", # Cash payments common
|
||||||
|
}
|
||||||
@@ -7,6 +7,7 @@ from typing import Optional, Tuple, List
|
|||||||
from dataclasses import dataclass, field
|
from dataclasses import dataclass, field
|
||||||
|
|
||||||
from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine
|
from backend.modules.data_entry.services.ocr.validation import OCRValidationEngine
|
||||||
|
from backend.modules.data_entry.services.ocr.profiles import ProfileRegistry
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
@@ -63,6 +64,57 @@ class ExtractionResult:
|
|||||||
class ReceiptExtractor:
|
class ReceiptExtractor:
|
||||||
"""Extract receipt fields using pattern matching for Romanian receipts."""
|
"""Extract receipt fields using pattern matching for Romanian receipts."""
|
||||||
|
|
||||||
|
# =========================================================================
|
||||||
|
# DEPRECATED: STORE_PROFILES dict - USE ProfileRegistry INSTEAD
|
||||||
|
# =========================================================================
|
||||||
|
# Store profiles are now managed by ProfileRegistry in:
|
||||||
|
# backend/modules/data_entry/services/ocr/profiles/
|
||||||
|
#
|
||||||
|
# This dict is kept for reference only. All extraction logic now uses:
|
||||||
|
# ProfileRegistry.get_profile(cui)
|
||||||
|
#
|
||||||
|
# See: backend/modules/data_entry/services/ocr/profiles/README.md
|
||||||
|
# =========================================================================
|
||||||
|
STORE_PROFILES = {
|
||||||
|
# Lidl - multi-rate TVA (A+B), specific format without hyphen/colon
|
||||||
|
"22891860": {
|
||||||
|
"name": "LIDL DISCOUNT S.R.L.",
|
||||||
|
"tva_pattern": "lidl",
|
||||||
|
"tva_format": "TVA {code} {percent}% {amount}",
|
||||||
|
"has_multi_rate_tva": True,
|
||||||
|
"card_equals_total": True,
|
||||||
|
},
|
||||||
|
# OMV Petrom - single TVA rate, client CUI included
|
||||||
|
"11201891": {
|
||||||
|
"name": "OMV PETROM MARKETING S.R.L.",
|
||||||
|
"tva_pattern": "standard",
|
||||||
|
"has_client_cui": True,
|
||||||
|
},
|
||||||
|
# FIVE-HOLDING (BRICK) - standard format
|
||||||
|
"10562600": {
|
||||||
|
"name": "FIVE-HOLDING S.A.",
|
||||||
|
"tva_pattern": "standard",
|
||||||
|
},
|
||||||
|
# Dedeman - e-factura format
|
||||||
|
"2816464": {
|
||||||
|
"name": "DEDEMAN SRL",
|
||||||
|
"tva_pattern": "standard",
|
||||||
|
"has_efactura": True,
|
||||||
|
},
|
||||||
|
# SOCAR Petroleum
|
||||||
|
"12546600": {
|
||||||
|
"name": "SOCAR PETROLEUM S.A.",
|
||||||
|
"tva_pattern": "standard",
|
||||||
|
"has_client_cui": True,
|
||||||
|
},
|
||||||
|
# Kineterra - non-VAT payer
|
||||||
|
"31180432": {
|
||||||
|
"name": "KINETERRA CONCEPT SRL",
|
||||||
|
"tva_pattern": "none",
|
||||||
|
"is_non_vat_payer": True,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
# Total amount patterns (most specific first)
|
# Total amount patterns (most specific first)
|
||||||
# Romanian receipts use various formats: TOTAL LEI, TOTAL:, TOTAL RON, etc.
|
# Romanian receipts use various formats: TOTAL LEI, TOTAL:, TOTAL RON, etc.
|
||||||
# OCR often produces errors, so patterns must be tolerant
|
# OCR often produces errors, so patterns must be tolerant
|
||||||
@@ -394,48 +446,101 @@ class ReceiptExtractor:
|
|||||||
result.raw_text = text
|
result.raw_text = text
|
||||||
text_upper = text.upper()
|
text_upper = text.upper()
|
||||||
|
|
||||||
# Extract core fields
|
# =========================================================================
|
||||||
result.amount, result.confidence_amount = self._extract_amount(text_upper)
|
# STEP 1: Extract vendor info FIRST to find store profile
|
||||||
result.receipt_date, result.confidence_date = self._extract_date(text_upper)
|
# =========================================================================
|
||||||
result.receipt_number, _ = self._extract_number(text_upper)
|
|
||||||
result.receipt_series, _ = self._extract_series(text_upper)
|
|
||||||
result.partner_name, result.confidence_vendor = self._extract_vendor(text)
|
result.partner_name, result.confidence_vendor = self._extract_vendor(text)
|
||||||
result.cui, _ = self._extract_cui(text_upper, text)
|
result.cui, _ = self._extract_cui(text_upper, text)
|
||||||
# Normalize CUI: fix R0 → RO OCR error and validate format
|
|
||||||
result.cui = OCRValidationEngine.normalize_cui(result.cui)
|
result.cui = OCRValidationEngine.normalize_cui(result.cui)
|
||||||
|
|
||||||
# Extract additional fields - Multiple TVA entries
|
# Lookup store-specific profile for enhanced extraction accuracy
|
||||||
result.tva_entries, result.tva_total = self._extract_tva_entries(text_upper)
|
store_profile = ProfileRegistry.get_profile(result.cui) if result.cui else None
|
||||||
|
if store_profile:
|
||||||
|
print(f"[Profile] Using {store_profile.__class__.__name__} for CUI {result.cui}", flush=True)
|
||||||
|
|
||||||
|
# =========================================================================
|
||||||
|
# STEP 2: Extract ALL fields using profile (if available) or generic
|
||||||
|
# =========================================================================
|
||||||
|
if store_profile:
|
||||||
|
# Profile-specific extraction (higher accuracy for known stores)
|
||||||
|
result.amount, result.confidence_amount = store_profile.extract_total(text_upper)
|
||||||
|
result.receipt_date, result.confidence_date = store_profile.extract_date(text_upper)
|
||||||
|
result.receipt_number, _ = store_profile.extract_receipt_number(text_upper)
|
||||||
|
result.tva_entries = store_profile.extract_tva_entries(text_upper)
|
||||||
|
result.tva_total = sum(e['amount'] for e in result.tva_entries) if result.tva_entries else None
|
||||||
|
result.payment_methods = store_profile.extract_payment_methods(text_upper)
|
||||||
|
|
||||||
|
# Client data extraction via profile (CUI + name)
|
||||||
|
profile_client_cui, cui_confidence = store_profile.extract_client_cui(text_upper)
|
||||||
|
profile_client_name, name_confidence = store_profile.extract_client_name(text)
|
||||||
|
|
||||||
|
if profile_client_cui or profile_client_name:
|
||||||
|
# Use profile extraction results
|
||||||
|
result.client_cui = OCRValidationEngine.normalize_cui(profile_client_cui) if profile_client_cui else None
|
||||||
|
result.client_name = profile_client_name
|
||||||
|
result.confidence_client = max(cui_confidence, name_confidence)
|
||||||
|
# Address still via generic (no profile method)
|
||||||
|
_, _, client_address, _ = self._extract_client_data(text_upper, text)
|
||||||
|
result.client_address = client_address
|
||||||
|
else:
|
||||||
|
# Fallback to generic client extraction
|
||||||
|
client_name, client_cui, client_address, confidence = self._extract_client_data(text_upper, text)
|
||||||
|
result.client_name = client_name
|
||||||
|
result.client_cui = OCRValidationEngine.normalize_cui(client_cui)
|
||||||
|
result.client_address = client_address
|
||||||
|
result.confidence_client = confidence
|
||||||
|
|
||||||
|
print(f"[Profile] Extracted: total={result.amount}, date={result.receipt_date}, "
|
||||||
|
f"TVA entries={len(result.tva_entries)}, payments={len(result.payment_methods)}", flush=True)
|
||||||
|
else:
|
||||||
|
# Generic extraction for unknown stores
|
||||||
|
result.amount, result.confidence_amount = self._extract_amount(text_upper)
|
||||||
|
result.receipt_date, result.confidence_date = self._extract_date(text_upper)
|
||||||
|
result.receipt_number, _ = self._extract_number(text_upper)
|
||||||
|
result.tva_entries, result.tva_total = self._extract_tva_entries(text_upper)
|
||||||
|
result.payment_methods = self._extract_payment_methods(text_upper)
|
||||||
|
|
||||||
|
# Generic client extraction
|
||||||
|
client_name, client_cui, client_address, confidence = self._extract_client_data(text_upper, text)
|
||||||
|
result.client_name = client_name
|
||||||
|
result.client_cui = OCRValidationEngine.normalize_cui(client_cui)
|
||||||
|
result.client_address = client_address
|
||||||
|
result.confidence_client = confidence
|
||||||
|
|
||||||
|
# Series extraction (no profile method, always generic)
|
||||||
|
result.receipt_series, _ = self._extract_series(text_upper)
|
||||||
|
|
||||||
|
# =========================================================================
|
||||||
|
# STEP 3: Debug logging and validation
|
||||||
|
# =========================================================================
|
||||||
if not result.tva_entries:
|
if not result.tva_entries:
|
||||||
print(f"[TVA Debug] No TVA found. Checking patterns...", flush=True)
|
print(f"[TVA Debug] No TVA found. Checking patterns...", flush=True)
|
||||||
# Debug: show what patterns see
|
|
||||||
normalized = re.sub(r'(\d+)[.,]\s+(\d{2})', r'\1.\2', text_upper)
|
normalized = re.sub(r'(\d+)[.,]\s+(\d{2})', r'\1.\2', text_upper)
|
||||||
taxe_match = re.search(r'T?OTAL\s+TAXE', normalized, re.IGNORECASE)
|
taxe_match = re.search(r'T?OTAL\s+TAXE', normalized, re.IGNORECASE)
|
||||||
rev_match = re.search(r'([\d.,]+)\s*T?OTAL\s+TAXE', normalized, re.IGNORECASE)
|
rev_match = re.search(r'([\d.,]+)\s*T?OTAL\s+TAXE', normalized, re.IGNORECASE)
|
||||||
print(f"[TVA Debug] 'OTAL TAXE' found: {bool(taxe_match)}, reversed: {rev_match.group(1) if rev_match else None}", flush=True)
|
print(f"[TVA Debug] 'OTAL TAXE' found: {bool(taxe_match)}, reversed: {rev_match.group(1) if rev_match else None}", flush=True)
|
||||||
|
|
||||||
# Log TVA vs TOTAL for debugging (validation happens in ocr_service._final_validation)
|
# Log TVA vs TOTAL for debugging
|
||||||
# NOTE: We NO LONGER clear TVA here - the service will recalculate TOTAL from TVA if needed
|
|
||||||
if result.tva_total and result.amount:
|
if result.tva_total and result.amount:
|
||||||
if result.tva_total > result.amount:
|
if result.tva_total > result.amount:
|
||||||
print(f"[TVA Extraction] TVA ({result.tva_total}) > TOTAL ({result.amount}) - will be corrected in final validation", flush=True)
|
print(f"[TVA Extraction] TVA ({result.tva_total}) > TOTAL ({result.amount}) - will be corrected in final validation", flush=True)
|
||||||
elif result.tva_total > result.amount * Decimal('0.5'):
|
elif result.tva_total > result.amount * Decimal('0.5'):
|
||||||
print(f"[TVA Extraction] Warning: TVA ({result.tva_total}) is > 50% of TOTAL ({result.amount}) - suspicious", flush=True)
|
print(f"[TVA Extraction] Warning: TVA ({result.tva_total}) is > 50% of TOTAL ({result.amount}) - suspicious", flush=True)
|
||||||
|
|
||||||
|
# Additional generic extractions
|
||||||
result.items_count = self._extract_items_count(text_upper)
|
result.items_count = self._extract_items_count(text_upper)
|
||||||
result.address = self._extract_address(text_upper)
|
result.address = self._extract_address(text_upper)
|
||||||
result.payment_methods = self._extract_payment_methods(text_upper)
|
|
||||||
|
|
||||||
# Validate payment methods against extracted amount
|
# =========================================================================
|
||||||
# If payment sum >> amount, clear invalid payments (likely OCR error)
|
# STEP 4: Validate and post-process
|
||||||
|
# =========================================================================
|
||||||
# Save original payment methods before validation (for payment mode detection)
|
# Save original payment methods before validation (for payment mode detection)
|
||||||
original_payment_methods = result.payment_methods.copy() if result.payment_methods else []
|
original_payment_methods = result.payment_methods.copy() if result.payment_methods else []
|
||||||
|
|
||||||
|
# Validate payment methods against extracted amount
|
||||||
result.payment_methods = self._validate_payment_methods(result.payment_methods, result.amount)
|
result.payment_methods = self._validate_payment_methods(result.payment_methods, result.amount)
|
||||||
|
|
||||||
# Auto-suggest payment_mode based on detected payment methods
|
# Auto-suggest payment_mode based on detected payment methods
|
||||||
# Use ORIGINAL payment_methods to detect CARD even if validation cleared them
|
|
||||||
# (e.g., CARD 318.16 is valid even if total validation failed)
|
|
||||||
payment_methods_for_mode = result.payment_methods if result.payment_methods else original_payment_methods
|
payment_methods_for_mode = result.payment_methods if result.payment_methods else original_payment_methods
|
||||||
if payment_methods_for_mode:
|
if payment_methods_for_mode:
|
||||||
card_amount = sum(
|
card_amount = sum(
|
||||||
@@ -447,17 +552,9 @@ class ReceiptExtractor:
|
|||||||
result.suggested_payment_mode = 'banca'
|
result.suggested_payment_mode = 'banca'
|
||||||
print(f"[Payment Mode] CARD detected ({card_amount}), suggesting 'banca'", flush=True)
|
print(f"[Payment Mode] CARD detected ({card_amount}), suggesting 'banca'", flush=True)
|
||||||
else:
|
else:
|
||||||
# Only cash payments detected
|
|
||||||
result.suggested_payment_mode = 'numerar'
|
result.suggested_payment_mode = 'numerar'
|
||||||
print(f"[Payment Mode] Cash only detected, suggesting 'numerar'", flush=True)
|
print(f"[Payment Mode] Cash only detected, suggesting 'numerar'", flush=True)
|
||||||
|
|
||||||
# Extract client data (B2B receipts)
|
|
||||||
client_name, client_cui, client_address, confidence_client = self._extract_client_data(text_upper, text)
|
|
||||||
result.client_name = client_name
|
|
||||||
result.client_cui = OCRValidationEngine.normalize_cui(client_cui) # Fix R0 → RO OCR error
|
|
||||||
result.client_address = client_address
|
|
||||||
result.confidence_client = confidence_client
|
|
||||||
|
|
||||||
# Detect receipt type
|
# Detect receipt type
|
||||||
result.receipt_type = self._detect_receipt_type(text_upper)
|
result.receipt_type = self._detect_receipt_type(text_upper)
|
||||||
|
|
||||||
@@ -620,6 +717,40 @@ class ReceiptExtractor:
|
|||||||
|
|
||||||
return num_str
|
return num_str
|
||||||
|
|
||||||
|
def _calculate_multi_rate_tva_total(self, tva_entries: List[dict]) -> Optional[Decimal]:
|
||||||
|
"""
|
||||||
|
Calculate implied total from ALL TVA entries (multi-rate support).
|
||||||
|
|
||||||
|
Formula for each entry: total_for_entry = tva * (100 + rate) / rate
|
||||||
|
Final total = sum of all entry totals
|
||||||
|
|
||||||
|
Example for Lidl (TVA A 21% = 7.71, TVA B 11% = 2.13):
|
||||||
|
Entry A: 7.71 * 121 / 21 = 44.45
|
||||||
|
Entry B: 2.13 * 111 / 11 = 21.49
|
||||||
|
Total: 44.45 + 21.49 = 65.94 ≈ 65.86 (within tolerance)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Implied total Decimal, or None if calculation not possible
|
||||||
|
"""
|
||||||
|
if not tva_entries:
|
||||||
|
return None
|
||||||
|
|
||||||
|
total = Decimal('0')
|
||||||
|
for entry in tva_entries:
|
||||||
|
rate = entry.get('percent', 0)
|
||||||
|
tva_amount = entry.get('amount')
|
||||||
|
if tva_amount and rate > 0:
|
||||||
|
try:
|
||||||
|
tva_dec = Decimal(str(tva_amount))
|
||||||
|
# Formula: total_for_entry = tva * (100 + rate) / rate
|
||||||
|
entry_total = tva_dec * Decimal(100 + rate) / Decimal(rate)
|
||||||
|
total += entry_total
|
||||||
|
print(f"[Multi-rate TVA] Entry {entry.get('code', '?')}: tva={tva_amount}, rate={rate}% -> implied={entry_total:.2f}", flush=True)
|
||||||
|
except (InvalidOperation, ValueError, TypeError):
|
||||||
|
continue
|
||||||
|
|
||||||
|
return total.quantize(Decimal('0.01')) if total > 0 else None
|
||||||
|
|
||||||
def _cross_validate_and_calculate_amount(
|
def _cross_validate_and_calculate_amount(
|
||||||
self,
|
self,
|
||||||
amount: Optional[Decimal],
|
amount: Optional[Decimal],
|
||||||
@@ -634,12 +765,11 @@ class ReceiptExtractor:
|
|||||||
Returns: (amount, confidence, source_description)
|
Returns: (amount, confidence, source_description)
|
||||||
|
|
||||||
Logic:
|
Logic:
|
||||||
1. If amount is valid (>0) with high confidence (>=0.8), use it directly
|
1. Collect all available sources: extracted amount, payment sum, TVA-implied total
|
||||||
2. Calculate payment_sum = CARD + NUMERAR + other methods
|
2. Find consensus: 2+ sources within 3% tolerance
|
||||||
3. Calculate tva_implied_total = tva_total * (100 + rate) / rate
|
3. If consensus found, use the higher-confidence source value
|
||||||
4. Cross-validate: if payment_sum matches extracted amount, boost confidence
|
4. If extracted differs >10% from all others, it's an outlier - correct it
|
||||||
5. If amount is 0/None, use payment_sum as total
|
5. If no consensus possible, fallback to individual validations
|
||||||
6. If payment_sum is 0, try to calculate from TVA
|
|
||||||
"""
|
"""
|
||||||
# Calculate payment methods sum
|
# Calculate payment methods sum
|
||||||
payment_sum = Decimal('0')
|
payment_sum = Decimal('0')
|
||||||
@@ -652,43 +782,73 @@ class ReceiptExtractor:
|
|||||||
except (InvalidOperation, ValueError, TypeError):
|
except (InvalidOperation, ValueError, TypeError):
|
||||||
continue
|
continue
|
||||||
|
|
||||||
# Calculate TVA-implied total: total = tva * (100 + rate) / rate
|
# Calculate TVA-implied total using ALL entries (multi-rate fix)
|
||||||
tva_implied_total = None
|
tva_implied_total = self._calculate_multi_rate_tva_total(tva_entries)
|
||||||
if tva_entries:
|
|
||||||
# Use the main TVA entry (typically the largest or first one)
|
|
||||||
main_entry = tva_entries[0]
|
|
||||||
rate = main_entry.get('percent', 19)
|
|
||||||
tva_amount = main_entry.get('amount')
|
|
||||||
if tva_amount and rate > 0:
|
|
||||||
try:
|
|
||||||
tva_dec = Decimal(str(tva_amount))
|
|
||||||
# total = tva * (100 + rate) / rate
|
|
||||||
tva_implied_total = (tva_dec * Decimal(100 + rate) / Decimal(rate)).quantize(Decimal('0.01'))
|
|
||||||
except (InvalidOperation, ValueError, TypeError):
|
|
||||||
pass
|
|
||||||
|
|
||||||
# Case 1: Amount is valid with high confidence - validate against TVA and payments
|
# Multi-source consensus approach (3% tolerance for multi-rate TVA rounding)
|
||||||
|
CONSENSUS_TOLERANCE = 3.0 # 3% tolerance
|
||||||
|
|
||||||
|
# Collect all available sources with their confidences
|
||||||
|
sources = []
|
||||||
|
if amount and amount > 0:
|
||||||
|
sources.append(('extracted', float(amount), confidence_amount))
|
||||||
|
if payment_sum > 0:
|
||||||
|
sources.append(('payment', float(payment_sum), 0.92)) # Payment is very reliable
|
||||||
|
if tva_implied_total and tva_implied_total > 0:
|
||||||
|
sources.append(('tva_calc', float(tva_implied_total), 0.88)) # TVA calc is reliable
|
||||||
|
|
||||||
|
print(f"[Cross-Validation] Sources: {[(s[0], f'{s[1]:.2f}', f'{s[2]:.2f}') for s in sources]}", flush=True)
|
||||||
|
|
||||||
|
# Find consensus: 2+ sources within tolerance
|
||||||
|
if len(sources) >= 2:
|
||||||
|
for i, (name1, val1, conf1) in enumerate(sources):
|
||||||
|
for name2, val2, conf2 in sources[i+1:]:
|
||||||
|
if val1 <= 0 or val2 <= 0:
|
||||||
|
continue
|
||||||
|
diff_pct = abs(val1 - val2) / max(val1, val2) * 100
|
||||||
|
if diff_pct <= CONSENSUS_TOLERANCE:
|
||||||
|
# Consensus found! Use value from higher-confidence source
|
||||||
|
if conf1 >= conf2:
|
||||||
|
consensus_val, consensus_conf = val1, conf1
|
||||||
|
else:
|
||||||
|
consensus_val, consensus_conf = val2, conf2
|
||||||
|
# Boost confidence for consensus
|
||||||
|
consensus_conf = min(0.98, consensus_conf + 0.05)
|
||||||
|
print(f"[Cross-Validation] Consensus: {name1}={val1:.2f} ≈ {name2}={val2:.2f} (diff={diff_pct:.1f}%)", flush=True)
|
||||||
|
return Decimal(str(round(consensus_val, 2))), consensus_conf, f"consensus ({name1}+{name2})"
|
||||||
|
|
||||||
|
# No consensus - check if extracted is an outlier (differs >10% from all others)
|
||||||
|
if amount and amount > 0 and len(sources) >= 2:
|
||||||
|
other_sources = [s for s in sources if s[0] != 'extracted']
|
||||||
|
if other_sources:
|
||||||
|
extracted_val = float(amount)
|
||||||
|
all_differ = all(
|
||||||
|
abs(extracted_val - s[1]) / max(extracted_val, s[1]) * 100 > 10
|
||||||
|
for s in other_sources if s[1] > 0
|
||||||
|
)
|
||||||
|
if all_differ:
|
||||||
|
# Extracted differs significantly from all others - use the best other source
|
||||||
|
best_other = max(other_sources, key=lambda s: s[2])
|
||||||
|
print(f"[Cross-Validation] Extracted outlier: {extracted_val:.2f} differs >10% from all others, using {best_other[0]}={best_other[1]:.2f}", flush=True)
|
||||||
|
return Decimal(str(round(best_other[1], 2))), best_other[2], f"corrected (extracted outlier, using {best_other[0]})"
|
||||||
|
|
||||||
|
# Fallback: Case 1 - Amount valid with high confidence
|
||||||
if amount and amount > 0 and confidence_amount >= 0.8:
|
if amount and amount > 0 and confidence_amount >= 0.8:
|
||||||
# First check TVA-implied total (most reliable when TVA is extracted correctly)
|
# Check TVA-implied total
|
||||||
if tva_implied_total and tva_implied_total > 0:
|
if tva_implied_total and tva_implied_total > 0:
|
||||||
tva_diff_percent = abs(float(amount) - float(tva_implied_total)) / float(tva_implied_total) * 100
|
tva_diff_percent = abs(float(amount) - float(tva_implied_total)) / float(tva_implied_total) * 100
|
||||||
if tva_diff_percent <= 1:
|
if tva_diff_percent <= 3:
|
||||||
# Near-perfect TVA match - highest confidence
|
|
||||||
return amount, min(0.98, confidence_amount + 0.05), "extracted (validated by TVA)"
|
return amount, min(0.98, confidence_amount + 0.05), "extracted (validated by TVA)"
|
||||||
elif tva_diff_percent > 10:
|
elif tva_diff_percent > 10:
|
||||||
# Significant mismatch - TVA-implied total is more reliable
|
|
||||||
# This catches cases where wrong TOTAL line was extracted (e.g., REST, SUBTOTAL)
|
|
||||||
print(f"[Cross-Validation] Amount mismatch with TVA: extracted={amount}, tva_implied={tva_implied_total} (diff={tva_diff_percent:.1f}%)", flush=True)
|
print(f"[Cross-Validation] Amount mismatch with TVA: extracted={amount}, tva_implied={tva_implied_total} (diff={tva_diff_percent:.1f}%)", flush=True)
|
||||||
return tva_implied_total, 0.90, "calculated from TVA (extracted amount mismatch)"
|
return tva_implied_total, 0.90, "calculated from TVA (extracted amount mismatch)"
|
||||||
|
|
||||||
# Cross-validate with payment methods
|
# Cross-validate with payment methods
|
||||||
if payment_sum > 0 and abs(amount - payment_sum) <= Decimal('0.02'):
|
if payment_sum > 0 and abs(amount - payment_sum) <= Decimal('0.02'):
|
||||||
# Perfect match - boost confidence
|
|
||||||
return amount, min(0.98, confidence_amount + 0.05), "extracted (validated by payment methods)"
|
return amount, min(0.98, confidence_amount + 0.05), "extracted (validated by payment methods)"
|
||||||
elif payment_sum > 0:
|
elif payment_sum > 0:
|
||||||
payment_diff_percent = abs(float(amount) - float(payment_sum)) / float(payment_sum) * 100
|
payment_diff_percent = abs(float(amount) - float(payment_sum)) / float(payment_sum) * 100
|
||||||
if payment_diff_percent > 10:
|
if payment_diff_percent > 10:
|
||||||
# Significant mismatch - payment sum is more reliable
|
|
||||||
print(f"[Cross-Validation] Amount mismatch with payments: extracted={amount}, payments={payment_sum} (diff={payment_diff_percent:.1f}%)", flush=True)
|
print(f"[Cross-Validation] Amount mismatch with payments: extracted={amount}, payments={payment_sum} (diff={payment_diff_percent:.1f}%)", flush=True)
|
||||||
return payment_sum, 0.88, "calculated from payment methods (extracted amount mismatch)"
|
return payment_sum, 0.88, "calculated from payment methods (extracted amount mismatch)"
|
||||||
|
|
||||||
@@ -696,29 +856,22 @@ class ReceiptExtractor:
|
|||||||
|
|
||||||
# Case 2: Amount exists but low confidence - try to validate/correct
|
# Case 2: Amount exists but low confidence - try to validate/correct
|
||||||
if amount and amount > 0:
|
if amount and amount > 0:
|
||||||
# First check TVA-implied total (most reliable)
|
|
||||||
if tva_implied_total and tva_implied_total > 0:
|
if tva_implied_total and tva_implied_total > 0:
|
||||||
tva_diff_percent = abs(float(amount) - float(tva_implied_total)) / float(tva_implied_total) * 100
|
tva_diff_percent = abs(float(amount) - float(tva_implied_total)) / float(tva_implied_total) * 100
|
||||||
if tva_diff_percent <= 2:
|
if tva_diff_percent <= 3:
|
||||||
# Close match - boost confidence
|
|
||||||
return amount, 0.88, "extracted (validated by TVA)"
|
return amount, 0.88, "extracted (validated by TVA)"
|
||||||
elif tva_diff_percent > 10:
|
elif tva_diff_percent > 10:
|
||||||
# Significant mismatch - use TVA-implied total
|
|
||||||
print(f"[Cross-Validation] Amount mismatch with TVA: extracted={amount}, tva_implied={tva_implied_total} (diff={tva_diff_percent:.1f}%)", flush=True)
|
print(f"[Cross-Validation] Amount mismatch with TVA: extracted={amount}, tva_implied={tva_implied_total} (diff={tva_diff_percent:.1f}%)", flush=True)
|
||||||
return tva_implied_total, 0.85, "calculated from TVA"
|
return tva_implied_total, 0.85, "calculated from TVA"
|
||||||
|
|
||||||
# Check if payment methods sum matches
|
|
||||||
if payment_sum > 0:
|
if payment_sum > 0:
|
||||||
payment_diff_percent = abs(float(amount) - float(payment_sum)) / float(payment_sum) * 100
|
payment_diff_percent = abs(float(amount) - float(payment_sum)) / float(payment_sum) * 100
|
||||||
if payment_diff_percent <= 0.5:
|
if payment_diff_percent <= 1:
|
||||||
# Close match - boost confidence
|
|
||||||
return amount, 0.90, "extracted (validated by payment methods)"
|
return amount, 0.90, "extracted (validated by payment methods)"
|
||||||
elif payment_diff_percent > 10:
|
elif payment_diff_percent > 10:
|
||||||
# Mismatch - prefer payment_sum as it's more reliable
|
|
||||||
print(f"[Cross-Validation] Amount mismatch: extracted={amount}, payments={payment_sum}", flush=True)
|
print(f"[Cross-Validation] Amount mismatch: extracted={amount}, payments={payment_sum}", flush=True)
|
||||||
return payment_sum, 0.85, "calculated from payment methods"
|
return payment_sum, 0.85, "calculated from payment methods"
|
||||||
|
|
||||||
# No validation possible - return as-is
|
|
||||||
return amount, confidence_amount, "extracted (unvalidated)"
|
return amount, confidence_amount, "extracted (unvalidated)"
|
||||||
|
|
||||||
# Case 3: Amount is 0 or None - calculate from payment methods
|
# Case 3: Amount is 0 or None - calculate from payment methods
|
||||||
@@ -946,6 +1099,28 @@ class ReceiptExtractor:
|
|||||||
|
|
||||||
return name
|
return name
|
||||||
|
|
||||||
|
def _get_store_profile(self, cui: Optional[str]) -> Optional[dict]:
|
||||||
|
"""
|
||||||
|
Get store-specific profile by CUI.
|
||||||
|
|
||||||
|
DEPRECATED: Use ProfileRegistry.get_profile() directly for profile objects.
|
||||||
|
This method is kept for backward compatibility and returns validation hints dict.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
cui: The CUI extracted from receipt (with or without RO prefix)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Store profile validation hints dict or None if not found
|
||||||
|
"""
|
||||||
|
profile = ProfileRegistry.get_profile(cui)
|
||||||
|
if profile:
|
||||||
|
# Return validation hints for backward compatibility
|
||||||
|
hints = profile.get_validation_hints()
|
||||||
|
hints['name'] = profile.STORE_NAME
|
||||||
|
print(f"[Store Profile] Found profile for {cui}: {profile.STORE_NAME}", flush=True)
|
||||||
|
return hints
|
||||||
|
return None
|
||||||
|
|
||||||
def _extract_cui(self, text_upper: str, original_text: str) -> Tuple[Optional[str], float]:
|
def _extract_cui(self, text_upper: str, original_text: str) -> Tuple[Optional[str], float]:
|
||||||
"""
|
"""
|
||||||
Extract vendor CUI (fiscal identification code) from text.
|
Extract vendor CUI (fiscal identification code) from text.
|
||||||
@@ -1020,11 +1195,114 @@ class ReceiptExtractor:
|
|||||||
# Default to bon_fiscal if neither found
|
# Default to bon_fiscal if neither found
|
||||||
return 'bon_fiscal'
|
return 'bon_fiscal'
|
||||||
|
|
||||||
|
def _try_pattern_lidl(self, text: str) -> List[dict]:
|
||||||
|
"""
|
||||||
|
Try Lidl-style TVA pattern: "TVA A 21,00% 7.71" (no hyphen/colon separator).
|
||||||
|
|
||||||
|
Lidl receipts format:
|
||||||
|
TOTAL TVA 9,84
|
||||||
|
TVA A 21,00% 7,71
|
||||||
|
TVA B 11,00% 2,13
|
||||||
|
|
||||||
|
Returns list of TVA entries found.
|
||||||
|
"""
|
||||||
|
entries = []
|
||||||
|
seen = set()
|
||||||
|
|
||||||
|
# Pattern: TVA/TUA/IVA + code (A-D) + percent + amount (on same line)
|
||||||
|
# Handles: "TVA A 21,00% 7,71", "TVA B 11,00% 2,13", "TUA A 21% 7.71"
|
||||||
|
lidl_patterns = [
|
||||||
|
# Same line: "TVA A 21,00% 7.71" (with various spacing)
|
||||||
|
r'T[VU][AR]\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
|
||||||
|
# Same line with backslash (OCR artifact): "TVA A \21,00% 7.71"
|
||||||
|
r'T[VU][AR]\s+([A-D])\s+\\?(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
|
||||||
|
# IVA variant
|
||||||
|
r'IVA\s+([A-D])\s+(\d{1,2})[.,]?\d{0,2}\s*%\s+([\d.,]+)',
|
||||||
|
]
|
||||||
|
|
||||||
|
for pattern in lidl_patterns:
|
||||||
|
for match in re.finditer(pattern, text, re.IGNORECASE):
|
||||||
|
try:
|
||||||
|
code = match.group(1).upper()
|
||||||
|
percent = int(match.group(2))
|
||||||
|
amount_str = self._normalize_number(match.group(3))
|
||||||
|
amount = Decimal(amount_str)
|
||||||
|
|
||||||
|
if amount > 0:
|
||||||
|
entry_key = (code, percent)
|
||||||
|
if entry_key not in seen:
|
||||||
|
entries.append({
|
||||||
|
'code': code,
|
||||||
|
'percent': percent,
|
||||||
|
'amount': amount
|
||||||
|
})
|
||||||
|
seen.add(entry_key)
|
||||||
|
print(f"[TVA Lidl] Found: TVA {code} {percent}% = {amount}", flush=True)
|
||||||
|
except (ValueError, InvalidOperation):
|
||||||
|
continue
|
||||||
|
|
||||||
|
return entries
|
||||||
|
|
||||||
|
def _select_best_tva_candidate(
|
||||||
|
self,
|
||||||
|
candidates: List[tuple],
|
||||||
|
tva_bon_total: Optional[Decimal]
|
||||||
|
) -> Tuple[List[dict], Optional[Decimal]]:
|
||||||
|
"""
|
||||||
|
Select the best TVA candidate from collected candidates.
|
||||||
|
|
||||||
|
Selection criteria (priority order):
|
||||||
|
1. Sum matches TOTAL TVA BON (highest priority)
|
||||||
|
2. More entries = better (for multi-rate receipts)
|
||||||
|
3. Pattern confidence as tiebreaker
|
||||||
|
|
||||||
|
Args:
|
||||||
|
candidates: List of (pattern_name, confidence, entries, sum)
|
||||||
|
tva_bon_total: Authoritative TOTAL TVA BON value (if extracted)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
(best_entries, best_sum)
|
||||||
|
"""
|
||||||
|
if not candidates:
|
||||||
|
return [], None
|
||||||
|
|
||||||
|
# Score each candidate
|
||||||
|
scored = []
|
||||||
|
for name, confidence, entries, sum_val in candidates:
|
||||||
|
score = 0.0
|
||||||
|
|
||||||
|
# Criterion 1: Sum matches TOTAL TVA BON (highest priority)
|
||||||
|
if tva_bon_total and sum_val:
|
||||||
|
tolerance = max(Decimal('0.02'), tva_bon_total * Decimal('0.02')) # 2% tolerance
|
||||||
|
if abs(sum_val - tva_bon_total) <= tolerance:
|
||||||
|
score += 100 # High bonus for matching authoritative total
|
||||||
|
print(f"[TVA Select] {name}: sum {sum_val} matches tva_bon_total {tva_bon_total}", flush=True)
|
||||||
|
|
||||||
|
# Criterion 2: More entries (for multi-rate receipts)
|
||||||
|
score += len(entries) * 10
|
||||||
|
|
||||||
|
# Criterion 3: Pattern confidence
|
||||||
|
score += confidence * 5
|
||||||
|
|
||||||
|
scored.append((score, name, confidence, entries, sum_val))
|
||||||
|
print(f"[TVA Select] Candidate {name}: score={score:.1f}, entries={len(entries)}, sum={sum_val}", flush=True)
|
||||||
|
|
||||||
|
# Sort by score descending
|
||||||
|
scored.sort(key=lambda x: x[0], reverse=True)
|
||||||
|
best = scored[0]
|
||||||
|
print(f"[TVA Select] Winner: {best[1]} (score={best[0]:.1f})", flush=True)
|
||||||
|
|
||||||
|
return best[3], best[4]
|
||||||
|
|
||||||
def _extract_tva_entries(self, text: str) -> Tuple[List[dict], Optional[Decimal]]:
|
def _extract_tva_entries(self, text: str) -> Tuple[List[dict], Optional[Decimal]]:
|
||||||
"""
|
"""
|
||||||
Extract multiple TVA (VAT) entries from text.
|
Extract multiple TVA (VAT) entries from text.
|
||||||
Romanian receipts can have multiple TVA rates (A=19%, B=9%, C=5%, D=0%).
|
Romanian receipts can have multiple TVA rates (A=19%, B=9%, C=5%, D=0%).
|
||||||
|
|
||||||
|
Uses CANDIDATE COLLECTION approach:
|
||||||
|
- Try ALL patterns and collect candidates
|
||||||
|
- Select best candidate based on matching TOTAL TVA BON
|
||||||
|
|
||||||
Returns (tva_entries, tva_total) where tva_entries is a list of:
|
Returns (tva_entries, tva_total) where tva_entries is a list of:
|
||||||
{'code': 'A', 'percent': 19, 'amount': Decimal('15.20')}
|
{'code': 'A', 'percent': 19, 'amount': Decimal('15.20')}
|
||||||
"""
|
"""
|
||||||
@@ -1054,6 +1332,22 @@ class ReceiptExtractor:
|
|||||||
# Also normalize comma followed by space to comma (for "21, 00%" -> "21,00%")
|
# Also normalize comma followed by space to comma (for "21, 00%" -> "21,00%")
|
||||||
normalized_text = re.sub(r'(\d+),\s+(\d{2})\s*%', r'\1.\2%', normalized_text)
|
normalized_text = re.sub(r'(\d+),\s+(\d{2})\s*%', r'\1.\2%', normalized_text)
|
||||||
|
|
||||||
|
# Extract TOTAL TVA BON/TOTAL TVA first as the authoritative reference
|
||||||
|
tva_bon_total = self._extract_total_tva_bon(normalized_text)
|
||||||
|
print(f"[TVA Debug] TOTAL TVA BON: {tva_bon_total}", flush=True)
|
||||||
|
|
||||||
|
# CANDIDATE COLLECTION APPROACH: Try all patterns, collect candidates, select best
|
||||||
|
all_candidates = [] # List of (pattern_name, confidence, entries, sum)
|
||||||
|
|
||||||
|
# === LIDL-STYLE PATTERNS (NEW) ===
|
||||||
|
# Lidl format: "TVA A 21,00% 7.71" or "TVA B 11,00% 2.13" (no hyphen/colon)
|
||||||
|
# This pattern handles multi-rate TVA receipts
|
||||||
|
lidl_entries = self._try_pattern_lidl(normalized_text)
|
||||||
|
if lidl_entries:
|
||||||
|
lidl_sum = sum(e['amount'] for e in lidl_entries)
|
||||||
|
all_candidates.append(('lidl', 0.96, lidl_entries, lidl_sum))
|
||||||
|
print(f"[TVA Debug] Lidl pattern: {len(lidl_entries)} entries, sum={lidl_sum}", flush=True)
|
||||||
|
|
||||||
# Pattern 0a: First try to get TVA from "TOTAL TAXE:" which is most reliable
|
# Pattern 0a: First try to get TVA from "TOTAL TAXE:" which is most reliable
|
||||||
# Format: "TOTAL TAXE: 55,22" - this is always the TVA amount
|
# Format: "TOTAL TAXE: 55,22" - this is always the TVA amount
|
||||||
# OCR may cut "T" producing "OTAL TAXE:" instead of "TOTAL TAXE:"
|
# OCR may cut "T" producing "OTAL TAXE:" instead of "TOTAL TAXE:"
|
||||||
@@ -1372,10 +1666,21 @@ class ReceiptExtractor:
|
|||||||
except (ValueError, InvalidOperation):
|
except (ValueError, InvalidOperation):
|
||||||
continue
|
continue
|
||||||
|
|
||||||
# Extract TOTAL TVA BON as reference (separate from individual entries)
|
# Add existing extraction results to candidates (if any)
|
||||||
tva_bon_total = self._extract_total_tva_bon(normalized_text)
|
if tva_entries:
|
||||||
|
entries_sum = sum(entry['amount'] for entry in tva_entries)
|
||||||
|
all_candidates.append(('standard', 0.90, tva_entries, entries_sum))
|
||||||
|
print(f"[TVA Debug] Standard patterns: {len(tva_entries)} entries, sum={entries_sum}", flush=True)
|
||||||
|
|
||||||
# Calculate sum from entries
|
# === CANDIDATE SELECTION ===
|
||||||
|
# Select best candidate using TOTAL TVA BON as authoritative reference
|
||||||
|
if all_candidates:
|
||||||
|
best_entries, best_sum = self._select_best_tva_candidate(all_candidates, tva_bon_total)
|
||||||
|
if best_entries:
|
||||||
|
tva_entries = best_entries
|
||||||
|
entries_sum = best_sum
|
||||||
|
|
||||||
|
# Calculate sum from entries (if not set by candidate selection)
|
||||||
entries_sum = None
|
entries_sum = None
|
||||||
if tva_entries:
|
if tva_entries:
|
||||||
entries_sum = sum(entry['amount'] for entry in tva_entries)
|
entries_sum = sum(entry['amount'] for entry in tva_entries)
|
||||||
|
|||||||
600
scripts/generate_store_profile.py
Executable file
600
scripts/generate_store_profile.py
Executable file
@@ -0,0 +1,600 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Store Profile Generator Script
|
||||||
|
|
||||||
|
Analyzes PDF receipts from a store and generates a Python profile class
|
||||||
|
for the OCR extraction system.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python scripts/generate_store_profile.py \
|
||||||
|
--name "Magazin Exemplu" \
|
||||||
|
--cui "12345678" \
|
||||||
|
--receipts "docs/data-entry/MagazinExemplu*.pdf" \
|
||||||
|
--output "backend/modules/data_entry/services/ocr/profiles/magazin_exemplu.py"
|
||||||
|
|
||||||
|
Features:
|
||||||
|
- Submits PDFs to OCR API
|
||||||
|
- Analyzes extracted text for patterns (TVA, total, date, payment)
|
||||||
|
- Generates a BaseStoreProfile subclass with detected patterns
|
||||||
|
- Supports hot-reload via ProfileRegistry
|
||||||
|
|
||||||
|
Requirements:
|
||||||
|
- Backend server running on localhost:8000
|
||||||
|
- JWT authentication
|
||||||
|
- python-jose, requests packages
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import glob
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from collections import Counter, defaultdict
|
||||||
|
from datetime import datetime, timedelta, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Dict, List, Optional, Tuple
|
||||||
|
|
||||||
|
try:
|
||||||
|
import requests
|
||||||
|
from jose import jwt
|
||||||
|
except ImportError:
|
||||||
|
print("Error: Required packages not installed.")
|
||||||
|
print("Run: pip install python-jose requests")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
|
# Configuration
|
||||||
|
API_BASE = os.getenv("API_BASE", "http://localhost:8000")
|
||||||
|
JWT_SECRET = os.getenv("JWT_SECRET_KEY", "GENERATE_NEW_SECRET_FOR_PRODUCTION3334!")
|
||||||
|
|
||||||
|
|
||||||
|
def create_jwt_token() -> str:
|
||||||
|
"""Create a test JWT token for API authentication."""
|
||||||
|
payload = {
|
||||||
|
"username": "PROFILE_GENERATOR",
|
||||||
|
"user_id": 1,
|
||||||
|
"companies": ["604"],
|
||||||
|
"permissions": ["read", "write"],
|
||||||
|
"exp": datetime.now(timezone.utc) + timedelta(hours=1),
|
||||||
|
"iat": datetime.now(timezone.utc),
|
||||||
|
"type": "access"
|
||||||
|
}
|
||||||
|
return jwt.encode(payload, JWT_SECRET, algorithm="HS256")
|
||||||
|
|
||||||
|
|
||||||
|
def submit_ocr(pdf_path: str, token: str, api_base: str = API_BASE, timeout: int = 120) -> Optional[Dict]:
|
||||||
|
"""
|
||||||
|
Submit a PDF to OCR API and wait for result.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
pdf_path: Path to PDF file
|
||||||
|
token: JWT authentication token
|
||||||
|
api_base: API base URL
|
||||||
|
timeout: Max seconds to wait for completion
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Extraction result dict or None on failure
|
||||||
|
"""
|
||||||
|
headers = {"Authorization": f"Bearer {token}"}
|
||||||
|
filename = os.path.basename(pdf_path)
|
||||||
|
|
||||||
|
print(f" Submitting: {filename}...", end=" ", flush=True)
|
||||||
|
|
||||||
|
try:
|
||||||
|
with open(pdf_path, "rb") as f:
|
||||||
|
files = {"file": (filename, f, "application/pdf")}
|
||||||
|
response = requests.post(
|
||||||
|
f"{api_base}/api/data-entry/ocr/extract?engine=doctr_plus",
|
||||||
|
files=files,
|
||||||
|
headers=headers,
|
||||||
|
timeout=30
|
||||||
|
)
|
||||||
|
|
||||||
|
if response.status_code != 200:
|
||||||
|
print(f"FAILED (HTTP {response.status_code})")
|
||||||
|
return None
|
||||||
|
|
||||||
|
job_data = response.json()
|
||||||
|
job_id = job_data.get("job_id")
|
||||||
|
|
||||||
|
if not job_id:
|
||||||
|
print("FAILED (no job_id)")
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Poll for completion
|
||||||
|
start_time = time.time()
|
||||||
|
while time.time() - start_time < timeout:
|
||||||
|
poll_response = requests.get(
|
||||||
|
f"{api_base}/api/data-entry/ocr/jobs/{job_id}/wait?timeout=30",
|
||||||
|
headers=headers,
|
||||||
|
timeout=35
|
||||||
|
)
|
||||||
|
|
||||||
|
if poll_response.status_code == 200:
|
||||||
|
job_result = poll_response.json()
|
||||||
|
status = job_result.get("status")
|
||||||
|
|
||||||
|
if status == "completed":
|
||||||
|
elapsed = time.time() - start_time
|
||||||
|
print(f"OK ({elapsed:.1f}s)")
|
||||||
|
return job_result.get("result", {})
|
||||||
|
elif status == "error":
|
||||||
|
print(f"ERROR: {job_result.get('error', 'Unknown')}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
time.sleep(2)
|
||||||
|
|
||||||
|
print("TIMEOUT")
|
||||||
|
return None
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"EXCEPTION: {e}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def analyze_tva_patterns(results: List[Dict]) -> Dict:
|
||||||
|
"""
|
||||||
|
Analyze TVA patterns from multiple extraction results.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict with detected patterns and statistics
|
||||||
|
"""
|
||||||
|
tva_entries = []
|
||||||
|
raw_texts = []
|
||||||
|
|
||||||
|
for r in results:
|
||||||
|
if r.get("tva_entries"):
|
||||||
|
tva_entries.extend(r["tva_entries"])
|
||||||
|
if r.get("raw_text"):
|
||||||
|
raw_texts.append(r["raw_text"])
|
||||||
|
|
||||||
|
# Analyze TVA code patterns (A, B, C, etc.)
|
||||||
|
codes = Counter(e.get("code") for e in tva_entries if e.get("code"))
|
||||||
|
|
||||||
|
# Analyze TVA percentage patterns
|
||||||
|
percents = Counter(e.get("percent") for e in tva_entries if e.get("percent"))
|
||||||
|
|
||||||
|
# Detect TVA format from raw text
|
||||||
|
tva_formats = defaultdict(int)
|
||||||
|
for text in raw_texts:
|
||||||
|
text_upper = text.upper()
|
||||||
|
|
||||||
|
# Standard format: "TVA 19% 10.50" or "TVA: 19% 10.50"
|
||||||
|
if re.search(r'TVA\s*:?\s*\d{1,2}%', text_upper):
|
||||||
|
tva_formats["standard"] += 1
|
||||||
|
|
||||||
|
# Lidl format: "TVA A 21% 7.71"
|
||||||
|
if re.search(r'TVA\s+[A-D]\s+\d{1,2}', text_upper):
|
||||||
|
tva_formats["lidl_multi_rate"] += 1
|
||||||
|
|
||||||
|
# Table format: "BAZA TVA | % TVA | VALOARE TVA"
|
||||||
|
if re.search(r'BAZA\s+TVA', text_upper):
|
||||||
|
tva_formats["table"] += 1
|
||||||
|
|
||||||
|
# No TVA (neplatitor)
|
||||||
|
if re.search(r'NEPLATITOR|NON.?TVA', text_upper):
|
||||||
|
tva_formats["non_vat"] += 1
|
||||||
|
|
||||||
|
return {
|
||||||
|
"codes": dict(codes),
|
||||||
|
"percents": dict(percents),
|
||||||
|
"formats": dict(tva_formats),
|
||||||
|
"has_multi_rate": len(codes) > 1,
|
||||||
|
"is_non_vat": tva_formats.get("non_vat", 0) > 0,
|
||||||
|
"dominant_format": max(tva_formats, key=tva_formats.get) if tva_formats else "standard"
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def analyze_total_patterns(results: List[Dict]) -> Dict:
|
||||||
|
"""Analyze TOTAL patterns from extraction results."""
|
||||||
|
totals = []
|
||||||
|
raw_texts = []
|
||||||
|
|
||||||
|
for r in results:
|
||||||
|
if r.get("amount"):
|
||||||
|
totals.append(float(r["amount"]))
|
||||||
|
if r.get("raw_text"):
|
||||||
|
raw_texts.append(r["raw_text"])
|
||||||
|
|
||||||
|
total_formats = defaultdict(int)
|
||||||
|
for text in raw_texts:
|
||||||
|
text_upper = text.upper()
|
||||||
|
|
||||||
|
if re.search(r'TOTAL\s*:?\s*[\d.,]+', text_upper):
|
||||||
|
total_formats["TOTAL:"] += 1
|
||||||
|
if re.search(r'TOTAL\s+DE\s+PLAT', text_upper):
|
||||||
|
total_formats["TOTAL DE PLATA"] += 1
|
||||||
|
if re.search(r'SUMA\s+TOTAL', text_upper):
|
||||||
|
total_formats["SUMA TOTALA"] += 1
|
||||||
|
if re.search(r'GRAND\s*TOTAL', text_upper):
|
||||||
|
total_formats["GRAND TOTAL"] += 1
|
||||||
|
|
||||||
|
return {
|
||||||
|
"count": len(totals),
|
||||||
|
"formats": dict(total_formats),
|
||||||
|
"dominant_format": max(total_formats, key=total_formats.get) if total_formats else "TOTAL"
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def analyze_date_patterns(results: List[Dict]) -> Dict:
|
||||||
|
"""Analyze date patterns from extraction results."""
|
||||||
|
dates = []
|
||||||
|
raw_texts = []
|
||||||
|
|
||||||
|
for r in results:
|
||||||
|
if r.get("receipt_date"):
|
||||||
|
dates.append(r["receipt_date"])
|
||||||
|
if r.get("raw_text"):
|
||||||
|
raw_texts.append(r["raw_text"])
|
||||||
|
|
||||||
|
date_formats = defaultdict(int)
|
||||||
|
for text in raw_texts:
|
||||||
|
# DD.MM.YYYY
|
||||||
|
if re.search(r'\d{2}\.\d{2}\.\d{4}', text):
|
||||||
|
date_formats["DD.MM.YYYY"] += 1
|
||||||
|
# YYYY.MM.DD (OMV/SOCAR style)
|
||||||
|
if re.search(r'\d{4}\.\d{2}\.\d{2}', text):
|
||||||
|
date_formats["YYYY.MM.DD"] += 1
|
||||||
|
# DD-MM-YYYY
|
||||||
|
if re.search(r'\d{2}-\d{2}-\d{4}', text):
|
||||||
|
date_formats["DD-MM-YYYY"] += 1
|
||||||
|
# DD/MM/YYYY
|
||||||
|
if re.search(r'\d{2}/\d{2}/\d{4}', text):
|
||||||
|
date_formats["DD/MM/YYYY"] += 1
|
||||||
|
|
||||||
|
return {
|
||||||
|
"extracted_dates": dates,
|
||||||
|
"formats": dict(date_formats),
|
||||||
|
"dominant_format": max(date_formats, key=date_formats.get) if date_formats else "DD.MM.YYYY"
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def analyze_payment_patterns(results: List[Dict]) -> Dict:
|
||||||
|
"""Analyze payment method patterns."""
|
||||||
|
payment_counts = defaultdict(int)
|
||||||
|
|
||||||
|
for r in results:
|
||||||
|
methods = r.get("payment_methods", [])
|
||||||
|
for m in methods:
|
||||||
|
method_type = m.get("method", "UNKNOWN")
|
||||||
|
payment_counts[method_type] += 1
|
||||||
|
|
||||||
|
return {
|
||||||
|
"methods": dict(payment_counts),
|
||||||
|
"has_mixed_payments": len(payment_counts) > 1
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def analyze_client_patterns(results: List[Dict]) -> Dict:
|
||||||
|
"""Analyze client (B2B) patterns."""
|
||||||
|
has_client_cui = 0
|
||||||
|
has_client_name = 0
|
||||||
|
|
||||||
|
for r in results:
|
||||||
|
if r.get("client_cui"):
|
||||||
|
has_client_cui += 1
|
||||||
|
if r.get("client_name"):
|
||||||
|
has_client_name += 1
|
||||||
|
|
||||||
|
return {
|
||||||
|
"has_client_cui": has_client_cui > 0,
|
||||||
|
"has_client_name": has_client_name > 0,
|
||||||
|
"b2b_ratio": has_client_cui / len(results) if results else 0
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def generate_profile_code(
|
||||||
|
store_name: str,
|
||||||
|
cui: str,
|
||||||
|
tva_analysis: Dict,
|
||||||
|
total_analysis: Dict,
|
||||||
|
date_analysis: Dict,
|
||||||
|
payment_analysis: Dict,
|
||||||
|
client_analysis: Dict
|
||||||
|
) -> str:
|
||||||
|
"""
|
||||||
|
Generate Python profile class code.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
store_name: Human-readable store name
|
||||||
|
cui: CUI number (without RO prefix)
|
||||||
|
*_analysis: Analysis results from pattern detection
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Python source code for the profile class
|
||||||
|
"""
|
||||||
|
# Generate class name from store name
|
||||||
|
class_name = "".join(
|
||||||
|
word.capitalize()
|
||||||
|
for word in re.sub(r'[^a-zA-Z0-9\s]', '', store_name).split()
|
||||||
|
) + "Profile"
|
||||||
|
|
||||||
|
# Generate module name
|
||||||
|
module_name = re.sub(r'[^a-z0-9]', '_', store_name.lower()).strip('_')
|
||||||
|
|
||||||
|
# Determine profile characteristics
|
||||||
|
is_non_vat = tva_analysis.get("is_non_vat", False)
|
||||||
|
has_multi_rate = tva_analysis.get("has_multi_rate", False)
|
||||||
|
has_client_cui = client_analysis.get("has_client_cui", False)
|
||||||
|
uses_yyyy_mm_dd = date_analysis.get("dominant_format") == "YYYY.MM.DD"
|
||||||
|
|
||||||
|
# Generate OCR name patterns
|
||||||
|
name_words = store_name.upper().split()
|
||||||
|
primary_word = name_words[0] if name_words else store_name.upper()
|
||||||
|
name_patterns = [
|
||||||
|
primary_word,
|
||||||
|
store_name.upper().replace(".", "").replace(",", ""),
|
||||||
|
]
|
||||||
|
# Add OCR error variants
|
||||||
|
ocr_variants = {
|
||||||
|
'O': '0', 'I': '1', 'L': '1', 'S': '5', 'B': '8', 'E': '3'
|
||||||
|
}
|
||||||
|
for char, replacement in ocr_variants.items():
|
||||||
|
if char in primary_word:
|
||||||
|
name_patterns.append(primary_word.replace(char, replacement, 1))
|
||||||
|
|
||||||
|
name_patterns = list(dict.fromkeys(name_patterns))[:4] # Unique, max 4
|
||||||
|
|
||||||
|
# Build the code
|
||||||
|
code_lines = [
|
||||||
|
'"""',
|
||||||
|
f'{store_name} store profile for OCR extraction.',
|
||||||
|
'',
|
||||||
|
'Auto-generated by generate_store_profile.py',
|
||||||
|
f'Generated: {datetime.now().strftime("%Y-%m-%d %H:%M")}',
|
||||||
|
'"""',
|
||||||
|
'',
|
||||||
|
'import re',
|
||||||
|
'from decimal import Decimal, InvalidOperation',
|
||||||
|
'from typing import List, Dict, Any',
|
||||||
|
'',
|
||||||
|
'from .base import BaseStoreProfile',
|
||||||
|
'from . import ProfileRegistry',
|
||||||
|
'',
|
||||||
|
'',
|
||||||
|
'@ProfileRegistry.register',
|
||||||
|
f'class {class_name}(BaseStoreProfile):',
|
||||||
|
' """',
|
||||||
|
f' {store_name} - OCR extraction profile.',
|
||||||
|
' ',
|
||||||
|
]
|
||||||
|
|
||||||
|
# Add characteristics to docstring
|
||||||
|
characteristics = []
|
||||||
|
if is_non_vat:
|
||||||
|
characteristics.append("Non-VAT payer (neplatitor TVA)")
|
||||||
|
if has_multi_rate:
|
||||||
|
characteristics.append("Multi-rate TVA")
|
||||||
|
if has_client_cui:
|
||||||
|
characteristics.append("B2B receipts with client CUI")
|
||||||
|
if uses_yyyy_mm_dd:
|
||||||
|
characteristics.append("Date format: YYYY.MM.DD")
|
||||||
|
|
||||||
|
if characteristics:
|
||||||
|
code_lines.append(' Key characteristics:')
|
||||||
|
for c in characteristics:
|
||||||
|
code_lines.append(f' - {c}')
|
||||||
|
code_lines.append(' ')
|
||||||
|
|
||||||
|
code_lines.extend([
|
||||||
|
' """',
|
||||||
|
'',
|
||||||
|
f' CUI_LIST = ["{cui}"]',
|
||||||
|
f' NAME_PATTERNS = {name_patterns}',
|
||||||
|
f' STORE_NAME = "{store_name}"',
|
||||||
|
'',
|
||||||
|
])
|
||||||
|
|
||||||
|
# Add date patterns override for YYYY.MM.DD format
|
||||||
|
if uses_yyyy_mm_dd:
|
||||||
|
code_lines.extend([
|
||||||
|
' # Override date patterns for YYYY.MM.DD format',
|
||||||
|
' DATE_PATTERNS_OCR_SPACES = [',
|
||||||
|
' r\'(\\d{4})[.,]\\s*(\\d{2})[.,]\\s*(\\d{2})\', # YYYY. MM. DD with spaces',
|
||||||
|
' r\'(\\d{4})[.,](\\d{2})[.,](\\d{2})\', # YYYY.MM.DD',
|
||||||
|
' ]',
|
||||||
|
'',
|
||||||
|
])
|
||||||
|
|
||||||
|
# Add TVA extraction method for multi-rate or non-VAT
|
||||||
|
if is_non_vat:
|
||||||
|
code_lines.extend([
|
||||||
|
' def extract_tva_entries(self, text: str) -> List[dict]:',
|
||||||
|
' """Non-VAT payer - returns empty list."""',
|
||||||
|
' return []',
|
||||||
|
'',
|
||||||
|
])
|
||||||
|
elif has_multi_rate and tva_analysis.get("dominant_format") == "lidl_multi_rate":
|
||||||
|
code_lines.extend([
|
||||||
|
' # Store-specific TVA patterns',
|
||||||
|
' TVA_PATTERNS = [',
|
||||||
|
' r\'T[VU][AR]\\s+([A-D])\\s+(\\d{1,2})[.,]?\\d{0,2}\\s*%\\s+([\\d.,]+)\',',
|
||||||
|
' ]',
|
||||||
|
'',
|
||||||
|
' def extract_tva_entries(self, text: str) -> List[dict]:',
|
||||||
|
' """Extract multi-rate TVA entries."""',
|
||||||
|
' entries = []',
|
||||||
|
' seen = set()',
|
||||||
|
'',
|
||||||
|
' for pattern in self.TVA_PATTERNS:',
|
||||||
|
' for match in re.finditer(pattern, text, re.IGNORECASE):',
|
||||||
|
' try:',
|
||||||
|
' code = match.group(1).upper()',
|
||||||
|
' percent = int(match.group(2))',
|
||||||
|
' amount = self._parse_decimal(match.group(3))',
|
||||||
|
'',
|
||||||
|
' if amount and amount > 0:',
|
||||||
|
' entry_key = (code, percent)',
|
||||||
|
' if entry_key not in seen:',
|
||||||
|
' entries.append({',
|
||||||
|
' \'code\': code,',
|
||||||
|
' \'percent\': percent,',
|
||||||
|
' \'amount\': amount',
|
||||||
|
' })',
|
||||||
|
' seen.add(entry_key)',
|
||||||
|
' except (ValueError, InvalidOperation):',
|
||||||
|
' continue',
|
||||||
|
'',
|
||||||
|
' return entries',
|
||||||
|
'',
|
||||||
|
])
|
||||||
|
|
||||||
|
# Add validation hints method
|
||||||
|
code_lines.extend([
|
||||||
|
' def get_validation_hints(self) -> Dict[str, Any]:',
|
||||||
|
f' """Return {store_name}-specific validation hints."""',
|
||||||
|
' return {',
|
||||||
|
f' "has_multi_rate_tva": {has_multi_rate},',
|
||||||
|
f' "card_equals_total": True,',
|
||||||
|
f' "has_client_cui": {has_client_cui},',
|
||||||
|
f' "has_efactura": False,',
|
||||||
|
f' "is_non_vat_payer": {is_non_vat},',
|
||||||
|
' }',
|
||||||
|
])
|
||||||
|
|
||||||
|
return '\n'.join(code_lines) + '\n'
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Generate store profile from PDF receipts",
|
||||||
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||||
|
epilog="""
|
||||||
|
Examples:
|
||||||
|
# Generate profile from a single PDF
|
||||||
|
python scripts/generate_store_profile.py \\
|
||||||
|
--name "Magazin Nou" --cui "12345678" \\
|
||||||
|
--receipts "docs/data-entry/magazin_nou.pdf"
|
||||||
|
|
||||||
|
# Generate profile from multiple PDFs (glob pattern)
|
||||||
|
python scripts/generate_store_profile.py \\
|
||||||
|
--name "Carrefour" --cui "2475489" \\
|
||||||
|
--receipts "docs/data-entry/Carrefour*.pdf" \\
|
||||||
|
--output backend/modules/data_entry/services/ocr/profiles/carrefour.py
|
||||||
|
|
||||||
|
# Dry run (analyze only, don't write file)
|
||||||
|
python scripts/generate_store_profile.py \\
|
||||||
|
--name "Test Store" --cui "11111111" \\
|
||||||
|
--receipts "docs/data-entry/test*.pdf" \\
|
||||||
|
--dry-run
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument("--name", required=True, help="Store name (e.g., 'LIDL DISCOUNT S.R.L.')")
|
||||||
|
parser.add_argument("--cui", required=True, help="CUI number without RO prefix")
|
||||||
|
parser.add_argument("--receipts", required=True, help="PDF file path or glob pattern")
|
||||||
|
parser.add_argument("--output", help="Output file path (default: auto-generated)")
|
||||||
|
parser.add_argument("--dry-run", action="store_true", help="Analyze only, don't write file")
|
||||||
|
parser.add_argument("--api-base", default=API_BASE, help=f"API base URL (default: {API_BASE})")
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Update API base if provided
|
||||||
|
api_base = args.api_base
|
||||||
|
|
||||||
|
# Validate CUI format
|
||||||
|
cui = args.cui.strip().replace("RO", "").replace(" ", "")
|
||||||
|
if not cui.isdigit() or len(cui) < 6 or len(cui) > 10:
|
||||||
|
print(f"Error: Invalid CUI format: {args.cui}")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
# Find PDF files
|
||||||
|
pdf_files = glob.glob(args.receipts)
|
||||||
|
if not pdf_files:
|
||||||
|
print(f"Error: No PDF files found matching: {args.receipts}")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print(f"Store Profile Generator")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
print(f"Store: {args.name}")
|
||||||
|
print(f"CUI: {cui}")
|
||||||
|
print(f"PDFs: {len(pdf_files)} files")
|
||||||
|
print(f"{'='*60}\n")
|
||||||
|
|
||||||
|
# Generate JWT token
|
||||||
|
token = create_jwt_token()
|
||||||
|
|
||||||
|
# Submit PDFs to OCR
|
||||||
|
print("Step 1: Submitting PDFs to OCR API...")
|
||||||
|
results = []
|
||||||
|
for pdf_path in pdf_files:
|
||||||
|
result = submit_ocr(pdf_path, token, api_base=api_base)
|
||||||
|
if result:
|
||||||
|
results.append(result)
|
||||||
|
|
||||||
|
if not results:
|
||||||
|
print("\nError: No successful extractions. Check if backend is running.")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
print(f"\nSuccessfully extracted: {len(results)}/{len(pdf_files)} PDFs")
|
||||||
|
|
||||||
|
# Analyze patterns
|
||||||
|
print("\nStep 2: Analyzing patterns...")
|
||||||
|
tva_analysis = analyze_tva_patterns(results)
|
||||||
|
total_analysis = analyze_total_patterns(results)
|
||||||
|
date_analysis = analyze_date_patterns(results)
|
||||||
|
payment_analysis = analyze_payment_patterns(results)
|
||||||
|
client_analysis = analyze_client_patterns(results)
|
||||||
|
|
||||||
|
print(f" TVA: {tva_analysis['dominant_format']} format, multi-rate={tva_analysis['has_multi_rate']}")
|
||||||
|
print(f" Date: {date_analysis['dominant_format']} format")
|
||||||
|
print(f" Payments: {list(payment_analysis['methods'].keys())}")
|
||||||
|
print(f" B2B: {client_analysis['has_client_cui']}")
|
||||||
|
|
||||||
|
# Generate profile code
|
||||||
|
print("\nStep 3: Generating profile code...")
|
||||||
|
code = generate_profile_code(
|
||||||
|
store_name=args.name,
|
||||||
|
cui=cui,
|
||||||
|
tva_analysis=tva_analysis,
|
||||||
|
total_analysis=total_analysis,
|
||||||
|
date_analysis=date_analysis,
|
||||||
|
payment_analysis=payment_analysis,
|
||||||
|
client_analysis=client_analysis
|
||||||
|
)
|
||||||
|
|
||||||
|
# Determine output path
|
||||||
|
if args.output:
|
||||||
|
output_path = args.output
|
||||||
|
else:
|
||||||
|
module_name = re.sub(r'[^a-z0-9]', '_', args.name.lower()).strip('_')
|
||||||
|
output_path = f"backend/modules/data_entry/services/ocr/profiles/{module_name}.py"
|
||||||
|
|
||||||
|
if args.dry_run:
|
||||||
|
print(f"\n[DRY RUN] Would write to: {output_path}")
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print("Generated code:")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
print(code)
|
||||||
|
else:
|
||||||
|
# Write file
|
||||||
|
os.makedirs(os.path.dirname(output_path), exist_ok=True)
|
||||||
|
with open(output_path, 'w') as f:
|
||||||
|
f.write(code)
|
||||||
|
print(f" Written to: {output_path}")
|
||||||
|
|
||||||
|
# Verify syntax
|
||||||
|
import py_compile
|
||||||
|
try:
|
||||||
|
py_compile.compile(output_path, doraise=True)
|
||||||
|
print(f" Syntax check: OK")
|
||||||
|
except py_compile.PyCompileError as e:
|
||||||
|
print(f" Syntax check: FAILED - {e}")
|
||||||
|
|
||||||
|
print(f"\n{'='*60}")
|
||||||
|
print("Profile generation complete!")
|
||||||
|
print(f"{'='*60}")
|
||||||
|
|
||||||
|
if not args.dry_run:
|
||||||
|
print(f"\nNext steps:")
|
||||||
|
print(f"1. Review the generated code: {output_path}")
|
||||||
|
print(f"2. Customize patterns if needed")
|
||||||
|
print(f"3. Hot-reload profiles: curl -X POST http://localhost:8000/api/data-entry/ocr/profiles/reload")
|
||||||
|
print(f"4. Test with a sample receipt")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
Reference in New Issue
Block a user