fix(import): NFKD normalization for non-Romanian diacritics

clean_web_text used a hard-coded Romanian-only translation map, so Hungarian (BALÁZS LORÁNT), German, Czech, Polish names passed through unchanged into SQLite and Oracle ROA. Replace with unicodedata.normalize('NFKD') + combining mark strip — covers RO/HU/DE/CZ/PL/FR/ES universally. Romanian cedilla legacy forms (ş/ţ/Ş/Ţ) remain handled (NFKD decomposes to base + combining cedilla). Stroke letters not decomposed by NFKD (ß, ł, đ, ø, æ, œ) covered via _NFKD_OVERRIDES translation map. sync_service._addr_match.norm migrated off the removed _DIACRITICS constant to clean_web_text; address matching now also handles non-RO diacritics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 22:52:50 +00:00
parent 9b62b2b457
commit 956667086d
3 changed files with 64 additions and 20 deletions
--- a/api/app/services/sync_service.py
+++ b/api/app/services/sync_service.py
@@ -48,7 +48,7 @@ def _addr_match(gomag_json, roa_json):
        r'ET|ETAJ|COM|COMUNA|SAT|MUN|MUNICIPIUL|JUD|JUDETUL|CARTIER|PARTER|SECTOR|SECTORUL|ORAS)(?:\b|(?=\d))'
    )
    def norm(s):
-        s = (s or '').translate(import_service._DIACRITICS).upper()
+        s = import_service.clean_web_text(s or '').upper()
        s = _ADDR_WORDS.sub('', s)
        return re.sub(r'[^A-Z0-9]', '', s)
    def _soundex(s):