feat(ocr): Add validation system and CLIENT CUI extraction
OCR Data Extraction Validation System: - Add 7 validation rules (amount range, TVA ratio, payment sum, etc.) - Add Medium preprocessing to replace Heavy (fixes digit concatenation) - Add validation warnings to API responses - Flag receipts needing manual review (needs_manual_review field) - Add database migration for needs_manual_review column CLIENT CUI Extraction Improvements: - Support all format variations: CIF CLIENT:, CLIENT C.U.I/C.I.F., etc. - Handle OCR errors (R0 vs RO, C1F vs CIF) - Add client_name, client_cui, client_address to API response - Add validation fields to API response (was missing) QA Review: 12 issues found, 9 fixed (5 errors + 4 warnings) - Fixed type safety in validation rules - Fixed ZeroDivisionError risk - Fixed schema mismatch (Optional[bool] for needs_manual_review) - All 37 unit tests passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -104,10 +104,80 @@ class ImagePreprocessor:
|
||||
# NO binarization, NO morphological ops - preserve original quality
|
||||
return enhanced
|
||||
|
||||
def preprocess_medium(self, image: np.ndarray) -> np.ndarray:
|
||||
"""
|
||||
Medium preprocessing for MIXED-QUALITY images.
|
||||
Balance between Light (too gentle) and Heavy (too aggressive).
|
||||
|
||||
Use cases:
|
||||
- Moderately faded receipts
|
||||
- Photos with uneven lighting
|
||||
- Scans with slight blur
|
||||
|
||||
Preprocessing steps:
|
||||
- Moderate contrast enhancement (CLAHE clipLimit=2.0)
|
||||
- Light denoising (fastNlMeansDenoising h=6)
|
||||
- Gentle sharpening
|
||||
- NO binarization (preserves text boundaries)
|
||||
- NO morphological operations (avoids digit concatenation)
|
||||
|
||||
This method was created to replace preprocess_heavy() which caused
|
||||
digit concatenation errors on high-quality PDFs (85.99 → 859,762.16).
|
||||
"""
|
||||
# 0. Add safety padding to protect edge content during deskew rotation
|
||||
image = self._add_safety_padding(image)
|
||||
|
||||
# 1. Grayscale
|
||||
if len(image.shape) == 3:
|
||||
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
|
||||
else:
|
||||
gray = image.copy()
|
||||
|
||||
# 2a. Scale DOWN if any side exceeds 4000px (PaddleOCR limit)
|
||||
height, width = gray.shape
|
||||
max_side = max(height, width)
|
||||
if max_side > 4000:
|
||||
scale = 4000 / max_side
|
||||
gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_AREA)
|
||||
height, width = gray.shape
|
||||
|
||||
# 2b. Scale UP if too small
|
||||
if width < 1500:
|
||||
scale = 1500 / width
|
||||
# Ensure we don't exceed 4000px after upscaling
|
||||
new_width = int(width * scale)
|
||||
new_height = int(height * scale)
|
||||
if max(new_width, new_height) > 4000:
|
||||
scale = 4000 / max(new_width, new_height)
|
||||
gray = cv2.resize(gray, None, fx=scale, fy=scale, interpolation=cv2.INTER_CUBIC)
|
||||
|
||||
# 3. Deskew
|
||||
gray = self._deskew(gray)
|
||||
|
||||
# 4. Moderate contrast enhancement (CLAHE clipLimit=2.0)
|
||||
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
|
||||
enhanced = clahe.apply(gray)
|
||||
|
||||
# 5. Light denoising (less aggressive than Heavy)
|
||||
denoised = cv2.fastNlMeansDenoising(enhanced, h=6, templateWindowSize=7, searchWindowSize=15)
|
||||
|
||||
# 6. Gentle sharpening
|
||||
gaussian = cv2.GaussianBlur(denoised, (0, 0), 1.0)
|
||||
sharpened = cv2.addWeighted(denoised, 1.3, gaussian, -0.3, 0)
|
||||
|
||||
# NO binarization, NO morphological operations
|
||||
# This preserves text boundaries and avoids digit concatenation
|
||||
return sharpened
|
||||
|
||||
def preprocess_heavy(self, image: np.ndarray) -> np.ndarray:
|
||||
"""
|
||||
Heavy preprocessing for FADED thermal receipts.
|
||||
Aggressive binarization to recover faded text.
|
||||
|
||||
⚠️ DEPRECATED: Use preprocess_medium() instead.
|
||||
Heavy preprocessing causes digit concatenation on clear PDFs
|
||||
(e.g., 85.99 → 859,762.16 due to binarization + morphological operations).
|
||||
Kept for backward compatibility only.
|
||||
"""
|
||||
# 0. Add safety padding to protect edge content during deskew rotation
|
||||
image = self._add_safety_padding(image)
|
||||
|
||||
Reference in New Issue
Block a user