feat(5.18): corpus k-NN exemple etichetate + seed real Haiku (17181 op)

Seed app/data/operatii-etichetate.json regenerat cu subagenti Haiku pe TOATE cele 17181 operatii distincte (ordine frecventa, 100%), inlocuind seed-ul Groq (3758). Validare Haiku vs Groq pe 157 op etichetate: la dezacorduri Haiku corect ~22/30, Groq ~0. Haiku prinde gunoiul ratat de Groq (ITP, chirie anvelope, nume piese fara actiune): NUL 2200 (12.8%) vs ~7.6% Groq; adaptare electronica OE-7 (nu OE-5), placute frana uzura OE-1 (nu OE-F avarie). US-001..006: prefiltru NUL determinist, etichetator offline, generator seed, seeder mapping_suggestions (in init_db, gated seed_operatii_enabled), embeddings indexeaza corpus etichetat, enrich NUL+kNN. Distributie seed: OE-1 80.1%, NUL 12.8%, OE-2 3.5%, restul rar (OE-4/3/7/8/R/I/5, AITLV, R-ODO). config: seed_operatii_enabled=True + embeddings_enabled=True implicit (SILVER populat + sugestii semantice; ambele suggestion-only, dezactivabile prin env). Suita: 1387 passed, 1 deselected (live). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 06:59:15 +00:00
parent c05fa00007
commit 756f77730f
17 changed files with 139308 additions and 44 deletions
--- a/tools/mapare-llm/eticheteaza.py
+++ b/tools/mapare-llm/eticheteaza.py
@@ -0,0 +1,258 @@
+"""Etichetator offline operatii service -> coduri RAR (US-002, PRD 5.18).
+
+Backend implicit = **LM Studio local** (Qwen3-4B, GPU RX 6600M via Tailscale),
+backend-ul APROBAT pentru bootstrap-ul v1 (decizia D4). Groq / OpenRouter raman
+fallback-uri interschimbabile, dar NU sunt calea aprobata pentru v1.
+
+Particularitati care justifica un tool NOU (nu reuse de `or_common.call`):
+  - LM Studio RESPINGE `response_format: json_object` (eroare 400). Cere envelope
+    `json_schema` STRICT complet: {"type":"json_schema","json_schema":{...,"strict":true}}.
+  - `cod` e ENUM peste cele 19 etichete (18 coduri RAR + NUL) -> modelul nu poate
+    inventa coduri; orice abatere e prinsa de garda de truncare ('?').
+  - Qwen3 emite `<think>...` daca nu dezactivam thinking-ul -> umfla tokeni/latenta
+    sub structured output strict. Punem `/no_think` in promptul de sistem.
+
+Setari conservatoare OBLIGATORII pe GPU-box (a facut shutdown sub sarcina 2026-06-29,
+probabil termic/alimentare): in LM Studio incarca modelul cu `n_parallel=1`,
+`n_ctx=4096`, batch 32-40, monitorizeaza temperatura. NU mari batch/context fara
+headroom termic. Vezi memorie `lmstudio-gpu-etichetare`.
+
+Reutilizeaza din `or_common`: scrub-ul PII (F3) si lista de coduri.
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import sys
+import time
+import urllib.error
+import urllib.request
+from dataclasses import dataclass
+
+# --- Coduri + scrub PII: sursa de adevar = or_common (acelasi nomenclator de etichete) ---
+import importlib.util as _ilu
+
+_OR_PATH = os.path.join(os.path.dirname(__file__), "or_common.py")
+_spec = _ilu.spec_from_file_location("or_common", _OR_PATH)
+or_common = _ilu.module_from_spec(_spec)
+sys.modules.setdefault("or_common", or_common)
+_spec.loader.exec_module(or_common)
+
+scrub = or_common.scrub  # VIN/placuta -> [VIN]/[NR]
+
+# Cele 19 etichete (18 coduri RAR + NUL), extrase din CODURI (sursa unica or_common).
+ALL_LABELS: list[str] = [c.split("=")[0].strip() for c in or_common.CODURI.replace(", ", ",").split(",")]
+assert "NUL" in ALL_LABELS and len(ALL_LABELS) == 19, ALL_LABELS
+_VALID = set(ALL_LABELS)
+
+
+# --------------------------------------------------------------------------- #
+# Prompt procedural in 3 pasi (versionat)                                      #
+# --------------------------------------------------------------------------- #
+
+PROMPT_VERSION = "3pasi-v1"
+
+_CODURI_LISTA = or_common.CODURI
+
+SYS = (
+    "Esti expert RAR AUTOPASS. Clasifici fiecare operatie de service-auto in EXACT unul "
+    "din aceste coduri:\n" + _CODURI_LISTA + "\n\n"
+    "Urmeaza PROCEDURA in 3 pasi, in ordine:\n"
+    "PAS 1 (non-operatie -> NUL): daca textul NU e o operatie tehnica de service "
+    "(ITP, plata/achitat, discount/reducere, taxa, nr inmatriculare/placuta, manopera "
+    "generica, sau DOAR un nume de piesa fara actiune) -> cod = NUL. Opreste-te.\n"
+    "PAS 2 (avarie din ACCIDENT -> avarie grava): foloseste codurile de avarie grava DOAR "
+    "pentru daune in urma unui accident, pe sistemul avariat:\n"
+    "  caroserie/structura rezistenta -> OE-C; sasiu -> OE-S; directie -> OE-D; "
+    "franare -> OE-F; sistem de retinere/airbag -> OE-R; ADAS (asistenta condus) -> OE-A.\n"
+    "  Reparatiile curente, de uzura (NU dintr-un accident) NU sunt avarii grave -> mergi la PAS 3.\n"
+    "PAS 3 (operatie obisnuita): \n"
+    "  inlocuire / D-R / reparare / vopsire / retus piese -> OE-1 (REPARATIE);\n"
+    "  schimb ulei motor + filtre -> OE-3 (REVIZIE PERIODICA);\n"
+    "  aerisit / gresat / completat nivele -> OE-2 (INTRETINERE);\n"
+    "  reglare functionala (geometrie directie, faruri, ralanti) -> OE-4;\n"
+    "  actualizare/programare software -> OE-7; schimb sezonier anvelope -> OE-8;\n"
+    "  istoric/reparatie/inlocuire odometru -> OE-I / R-ODO / I-ODO; tahograf -> AITLV.\n\n"
+    "Raspunde DOAR cu JSON conform schemei. /no_think"
+)
+
+
+def construieste_mesaje(batch: list[str]) -> list[dict]:
+    """Mesajele chat (system procedural + user enumerat). Scrub PII pe fiecare item."""
+    user = "\n".join(f"{i + 1}. {scrub(o)}" for i, o in enumerate(batch))
+    return [
+        {"role": "system", "content": SYS},
+        {"role": "user", "content": user},
+    ]
+
+
+# --------------------------------------------------------------------------- #
+# Schema json_schema strict (envelope complet — LM Studio respinge json_object) #
+# --------------------------------------------------------------------------- #
+
+def _response_format() -> dict:
+    return {
+        "type": "json_schema",
+        "json_schema": {
+            "name": "etichete_operatii",
+            "strict": True,
+            "schema": {
+                "type": "object",
+                "properties": {
+                    "rez": {
+                        "type": "array",
+                        "items": {
+                            "type": "object",
+                            "properties": {
+                                "i": {"type": "integer"},
+                                "cod": {"type": "string", "enum": ALL_LABELS},
+                            },
+                            "required": ["i", "cod"],
+                            "additionalProperties": False,
+                        },
+                    }
+                },
+                "required": ["rez"],
+                "additionalProperties": False,
+            },
+        },
+    }
+
+
+# --------------------------------------------------------------------------- #
+# Backend-uri (LM Studio default; Groq/OpenRouter fallback)                    #
+# --------------------------------------------------------------------------- #
+
+@dataclass
+class Backend:
+    name: str
+    url: str
+    model: str
+    api_key: str | None = None
+
+
+# Endpoint LM Studio implicit = GPU-box pe Tailscale (memorie lmstudio-gpu-etichetare).
+_DEFAULT_LMSTUDIO_URL = "http://100.64.151.22:1234/v1/chat/completions"
+
+_BACKENDS = {
+    "lmstudio": {"url": _DEFAULT_LMSTUDIO_URL, "model": "qwen/qwen3-4b", "key_env": None},
+    "groq": {"url": "https://api.groq.com/openai/v1/chat/completions",
+             "model": "llama-3.3-70b-versatile", "key_env": "GROQ_KEY"},
+    "openrouter": {"url": "https://openrouter.ai/api/v1/chat/completions",
+                   "model": "qwen/qwen3-4b:free", "key_env": "OPENROUTER_KEY"},
+}
+
+
+def get_backend(name: str | None = None) -> Backend:
+    """Construieste backend-ul din env. Default = lmstudio (D4).
+
+    Override-uri: ETICHETARE_BACKEND, ETICHETARE_ENDPOINT, ETICHETARE_MODEL.
+    Cheia API (Groq/OpenRouter) se citeste din env-ul indicat de backend; LM Studio
+    local nu cere cheie.
+    """
+    name = (name or os.environ.get("ETICHETARE_BACKEND") or "lmstudio").strip().lower()
+    if name not in _BACKENDS:
+        raise ValueError(f"backend necunoscut: {name} (alege din {list(_BACKENDS)})")
+    cfg = _BACKENDS[name]
+    url = os.environ.get("ETICHETARE_ENDPOINT") or cfg["url"]
+    model = os.environ.get("ETICHETARE_MODEL") or cfg["model"]
+    api_key = os.environ.get(cfg["key_env"]) if cfg["key_env"] else None
+    return Backend(name=name, url=url, model=model, api_key=api_key)
+
+
+def construieste_body(batch: list[str], backend: Backend) -> dict:
+    """Corpul request-ului OpenAI-compatibil cu envelope json_schema strict."""
+    return {
+        "model": backend.model,
+        "messages": construieste_mesaje(batch),
+        "temperature": 0,
+        "response_format": _response_format(),
+    }
+
+
+# --------------------------------------------------------------------------- #
+# Parsare + garda de truncare                                                 #
+# --------------------------------------------------------------------------- #
+
+def parseaza_raspuns(content: dict, n: int) -> list[str]:
+    """Mapeaza raspunsul {"rez":[{i,cod}]} la o lista paralela cu batch-ul (len n).
+
+    Garda de truncare/validare (F8): pozitiile lipsa SAU codurile in afara enum-ului
+    devin '?', NU sunt ascunse tacit. Apelantul logheaza cate '?' au ramas.
+    """
+    by_i: dict[int, str] = {}
+    for x in content.get("rez") or []:
+        try:
+            idx = int(x["i"])
+        except (KeyError, TypeError, ValueError):
+            continue
+        cod = str(x.get("cod") or "").strip().upper()
+        by_i[idx] = cod if cod in _VALID else "?"
+    return [by_i.get(i + 1, "?") for i in range(n)]
+
+
+# --------------------------------------------------------------------------- #
+# Transport (injectabil in teste)                                             #
+# --------------------------------------------------------------------------- #
+
+def _urllib_transport(url: str, headers: dict, payload: dict, timeout: int) -> dict:
+    data = json.dumps(payload).encode()
+    req = urllib.request.Request(url, data=data, headers=headers)
+    with urllib.request.urlopen(req, timeout=timeout) as r:
+        return json.load(r)
+
+
+def call(
+    batch: list[str],
+    backend: Backend,
+    *,
+    timeout: int = 180,
+    max_attempts: int = 5,
+    transport=None,
+) -> tuple[list[str], dict]:
+    """Un apel pe un batch. Intoarce (codes, meta).
+
+    codes: lista paralela cu batch; '?' pe pozitiile fara raspuns valid (garda F8).
+    meta: {ms, err, missing} — `missing` = cate '?' au ramas (truncare/cod invalid).
+    transport: callable(url, headers, payload, timeout) -> dict raspuns OpenAI
+               (injectabil in teste; default urllib).
+    """
+    transport = transport or _urllib_transport
+    body = construieste_body(batch, backend)
+    headers = {"Content-Type": "application/json", "User-Agent": "Mozilla/5.0"}
+    if backend.api_key:
+        headers["Authorization"] = f"Bearer {backend.api_key}"
+    t0 = time.time()
+    for attempt in range(max_attempts):
+        try:
+            resp = transport(backend.url, headers, body, timeout)
+            content = json.loads(resp["choices"][0]["message"]["content"])
+            codes = parseaza_raspuns(content, len(batch))
+            missing = codes.count("?")
+            return codes, {"ms": int((time.time() - t0) * 1000), "err": None, "missing": missing}
+        except urllib.error.HTTPError as e:
+            if e.code in (429, 500, 502, 503):
+                wait = float(e.headers.get("retry-after", 0)) or min(2 ** attempt, 30)
+                time.sleep(wait)
+                continue
+            return ["?"] * len(batch), {"ms": int((time.time() - t0) * 1000), "err": f"HTTP {e.code}", "missing": len(batch)}
+        except Exception as e:  # noqa: BLE001 — degradare gratioasa, batch-ul devine '?'
+            if attempt < max_attempts - 1:
+                time.sleep(min(2 ** attempt, 20))
+                continue
+            return ["?"] * len(batch), {"ms": int((time.time() - t0) * 1000), "err": type(e).__name__, "missing": len(batch)}
+    return ["?"] * len(batch), {"ms": int((time.time() - t0) * 1000), "err": "max_attempts", "missing": len(batch)}
+
+
+if __name__ == "__main__":
+    # Sanity-check manual: 1 batch mic pe backend-ul configurat (default lmstudio).
+    import sys
+
+    probe = sys.argv[1:] or ["13 X ITP", "INLOCUIT PLACUTE FRANA FATA", "SCHIMB ULEI MOTOR SI FILTRE"]
+    b = get_backend()
+    print(f"backend={b.name} url={b.url} model={b.model}")
+    codes, meta = call(probe, b)
+    for op, c in zip(probe, codes):
+        print(f"  {c:6} {op}")
+    print("meta:", meta)
--- a/tools/mapare-llm/genereaza_seed.py
+++ b/tools/mapare-llm/genereaza_seed.py
@@ -0,0 +1,344 @@
+"""Generare seed etichetat operatie->cod (US-003, PRD 5.18).
+
+Produce artefactul `app/data/operatii-etichetate.json` (comis in repo), consumat de
+seeder (US-004) si de corpusul embeddings (US-005). NU cheama LLM la runtime — o
+singura data, offline, pe LM Studio (backend implicit, D4).
+
+Pipeline dedup OBLIGATORIU, in ordine, INAINTE de orice apel LLM (D5):
+  1. Agrega cele N CSV-uri -> freq pe denumire RAW (NR ne-numeric -> skip rand, F9).
+  2. `cheie = normalize_for_match(denumire)` (ACEEASI functie ca DB/k-NN, NU strip exact).
+     Arunca randurile cu `cheie == ""` inainte de dedup (coliziune pe slot UNIQUE gol, F6).
+  3. Dedup pe cheie: un reprezentant per cheie, `freq = suma NR`.
+  4. Harta `cheie -> cod` din TOATE etichetele existente: `labels-groq-partial.json` (cheiat
+     brut) + seedul comis anterior (cheiat normalizat). Conflict (acelasi cheie, coduri diferite
+     pe variante raw) -> castiga codul cu freq-max, tie-break pe cod sortat (F3).
+  5. `de_etichetat = corpus(in prag) - harta`. Sortat desc pe freq = SINGURUL input la LLM.
+
+Idempotenta cross-run (F2/F7): seedul comis = cache de etichete -> re-run = 0 apeluri LLM.
+"""
+
+from __future__ import annotations
+
+import argparse
+import csv
+import glob
+import importlib.util
+import json
+import os
+import sys
+from collections import Counter, defaultdict
+
+# Functia de normalizare = sursa unica de adevar (consistenta cu DB/k-NN).
+_APP_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", ".."))
+if _APP_ROOT not in sys.path:
+    sys.path.insert(0, _APP_ROOT)
+from app.mapping import normalize_for_match  # noqa: E402
+
+
+def _load_eticheteaza():
+    path = os.path.join(os.path.dirname(__file__), "eticheteaza.py")
+    spec = importlib.util.spec_from_file_location("eticheteaza", path)
+    mod = importlib.util.module_from_spec(spec)
+    sys.modules.setdefault("eticheteaza", mod)
+    spec.loader.exec_module(mod)
+    return mod
+
+
+# Cai implicite (relative la repo).
+DEFAULT_CSV_GLOB = os.path.join(_APP_ROOT, "docs", "operatii-service", "*.csv")
+DEFAULT_LABELS = os.path.join(_APP_ROOT, "tools", "mapare-llm", "labels-groq-partial.json")
+DEFAULT_SEED = os.path.join(_APP_ROOT, "app", "data", "operatii-etichetate.json")
+
+NUL_LABEL = "NUL"
+DEFAULT_CONFIDENCE = 0.7
+DEFAULT_SOURCE = "llm_seed"
+
+
+# --------------------------------------------------------------------------- #
+# Pasul 1-3: corpus agregat pe cheie normalizata                              #
+# --------------------------------------------------------------------------- #
+
+def _freq_raw(csv_paths: list[str]) -> Counter:
+    """Counter denumire_raw -> suma NR. NR ne-numeric -> skip rand (F9), nu zero-weight."""
+    freq: Counter = Counter()
+    for f in csv_paths:
+        with open(f, encoding="utf-8", errors="replace") as fh:
+            for r in list(csv.reader(fh, delimiter=";"))[1:]:
+                if len(r) <= 2:
+                    continue
+                den = r[1].strip()
+                if not den:
+                    continue
+                nr_raw = (r[2] or "").strip()
+                try:
+                    nr = int(nr_raw)
+                except ValueError:
+                    continue  # F9: skip rand cu NR ne-numeric
+                freq[den] += nr
+    return freq
+
+
+def _corpus_din_freq(freq_raw: Counter) -> dict[str, dict]:
+    """{cheie_normalizata -> {denumire, freq}}. Arunca cheile vide (F6).
+
+    `denumire` = varianta raw cu freq individual maxim (tie-break: raw sortat asc),
+    folosita ca text trimis la LLM si stocata in seed.
+    """
+    grup: dict[str, list[tuple[str, int]]] = defaultdict(list)
+    for raw, n in freq_raw.items():
+        cheie = normalize_for_match(raw)
+        if not cheie:
+            continue  # F6
+        grup[cheie].append((raw, n))
+
+    corpus: dict[str, dict] = {}
+    for cheie, variante in grup.items():
+        freq = sum(n for _, n in variante)
+        # reprezentant determinist: freq max, tie-break raw sortat.
+        denumire = sorted(variante, key=lambda rn: (-rn[1], rn[0]))[0][0]
+        corpus[cheie] = {"denumire": denumire, "freq": freq}
+    return corpus
+
+
+def agrega_corpus(csv_paths: list[str]) -> dict[str, dict]:
+    """{cheie_normalizata -> {denumire, freq}} din CSV-uri (pasii 1-3)."""
+    return _corpus_din_freq(_freq_raw(csv_paths))
+
+
+# --------------------------------------------------------------------------- #
+# Pasul 4: harta cheie -> cod din etichetele existente (reuse + conflict)      #
+# --------------------------------------------------------------------------- #
+
+def _incarca_seed(seed_path: str | None) -> list[dict]:
+    if not seed_path or not os.path.exists(seed_path):
+        return []
+    try:
+        return json.loads(open(seed_path, encoding="utf-8").read())
+    except (ValueError, OSError):
+        return []
+
+
+def construieste_harta_etichete(
+    freq_raw: Counter,
+    corpus: dict[str, dict],
+    labels_path: str | None,
+    seed_existent: list[dict],
+) -> dict[str, str]:
+    """Harta cheie_normalizata -> eticheta (cod RAR sau 'NUL'), reuse in spatiu normalizat.
+
+    Voturi ponderate pe freq; conflict pe acelasi cheie -> freq-max, tie-break cod sortat (F3).
+    """
+    votes: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
+
+    # labels-groq-partial.json: cheiat pe text BRUT.
+    if labels_path and os.path.exists(labels_path):
+        labels = json.loads(open(labels_path, encoding="utf-8").read())
+        for raw, cod in labels.items():
+            cheie = normalize_for_match(raw)
+            if not cheie:
+                continue
+            cod = str(cod or "").strip().upper()
+            if not cod:
+                continue
+            votes[cheie][cod] += freq_raw.get(raw, 0)
+
+    # seed comis anterior: cheiat normalizat (cache cross-run).
+    for e in seed_existent:
+        cheie = e.get("denumire_normalizata")
+        if not cheie:
+            continue
+        eticheta = NUL_LABEL if e.get("is_nul") else str(e.get("cod") or "").strip().upper()
+        if not eticheta:
+            continue
+        votes[cheie][eticheta] += corpus.get(cheie, {}).get("freq", 0)
+
+    harta: dict[str, str] = {}
+    for cheie, codmap in votes.items():
+        # freq desc, apoi cod asc -> determinist.
+        harta[cheie] = sorted(codmap.items(), key=lambda kv: (-kv[1], kv[0]))[0][0]
+    return harta
+
+
+# --------------------------------------------------------------------------- #
+# Pasul 5: selectie de_etichetat (prag de volum) + orchestrare                #
+# --------------------------------------------------------------------------- #
+
+def selecteaza_de_etichetat(
+    corpus: dict[str, dict],
+    harta: dict[str, str],
+    *,
+    target_volum: float,
+    etichetare_all: bool,
+) -> list[str]:
+    """Cheile ne-etichetate, sortate desc pe freq, in interiorul pragului de volum."""
+    ordered = sorted(corpus, key=lambda k: (-corpus[k]["freq"], k))
+    if etichetare_all:
+        in_prag = ordered
+    else:
+        total = sum(c["freq"] for c in corpus.values()) or 1
+        in_prag = []
+        cum = 0
+        for k in ordered:
+            in_prag.append(k)
+            cum += corpus[k]["freq"]
+            if cum / total >= target_volum:
+                break
+    return [k for k in in_prag if k not in harta]
+
+
+def genereaza(
+    csv_paths: list[str],
+    *,
+    labels_path: str | None = DEFAULT_LABELS,
+    seed_path: str = DEFAULT_SEED,
+    target_volum: float = 0.9,
+    etichetare_all: bool = False,
+    clasifica=None,
+    batch: int = 32,
+    confidence: float = DEFAULT_CONFIDENCE,
+    source: str = DEFAULT_SOURCE,
+    progres=None,
+    checkpoint_every: int = 1,
+    pauza: float = 0.0,
+) -> dict:
+    """Genereaza/actualizeaza seedul. Intoarce statistici. Scrie `seed_path`.
+
+    `clasifica(batch_denumiri) -> list[cod]` e injectabil (teste); default = LM Studio.
+    `progres(mesaj)` e un callback optional de logare.
+
+    Checkpointing (`checkpoint_every` batch-uri): seedul se scrie pe disc periodic in
+    timpul rularii, NU doar la final. Esential pe GPU-box-ul instabil (shutdown termic
+    sub sarcina, memorie lmstudio-gpu-etichetare): un crash la batch-ul 80/104 pastreaza
+    progresul, iar re-run-ul continua din cache (idempotenta cross-run). 0 = doar la final.
+    """
+    freq_raw = _freq_raw(csv_paths)
+    corpus = _corpus_din_freq(freq_raw)
+    seed_existent = _incarca_seed(seed_path)
+    harta = construieste_harta_etichete(freq_raw, corpus, labels_path, seed_existent)
+    de_etichetat = selecteaza_de_etichetat(
+        corpus, harta, target_volum=target_volum, etichetare_all=etichetare_all
+    )
+    reused = len(harta)
+
+    brute = int(sum(freq_raw.values()))
+    if progres:
+        progres(f"{len(freq_raw)} randuri brute distincte -> {len(corpus)} dupa normalizare "
+                f"-> {len(de_etichetat)} trimise la LLM (deja: {len(harta)})")
+
+    clasif = clasifica
+    if clasif is None:
+        et = _load_eticheteaza()
+        backend = et.get_backend()
+        if progres:
+            progres(f"backend={backend.name} url={backend.url} model={backend.model}")
+
+        def clasif(batch_denumiri):
+            return et.call(batch_denumiri, backend)[0]
+
+    apeluri = 0
+    valide = _valid_labels()
+    nr_batch = (len(de_etichetat) + batch - 1) // batch
+    for k in range(0, len(de_etichetat), batch):
+        chunk = de_etichetat[k:k + batch]
+        denumiri = [corpus[c]["denumire"] for c in chunk]
+        codes = clasif(denumiri)
+        apeluri += 1
+        for cheie, cod in zip(chunk, codes):
+            cod = str(cod or "").strip().upper()
+            if cod in valide:  # '?' / cod invalid -> ramane ne-etichetat (retry la urmatorul run)
+                harta[cheie] = cod
+        if progres:
+            progres(f"  batch {apeluri}/{nr_batch} "
+                    f"-> total etichetat {sum(1 for c in harta if c in corpus)}")
+        # Checkpoint periodic: protejeaza progresul pe GPU-box instabil.
+        if checkpoint_every and apeluri % checkpoint_every == 0:
+            _scrie_seed(seed_path, _construieste_seed(corpus, harta, confidence=confidence, source=source))
+        # Pauza intre batch-uri: ragaz termic pentru GPU-box (shutdown sub sarcina sustinuta).
+        if pauza and k + batch < len(de_etichetat):
+            import time as _t
+            _t.sleep(pauza)
+
+    seed = _construieste_seed(corpus, harta, confidence=confidence, source=source)
+    _scrie_seed(seed_path, seed)
+    return {
+        "brute": brute,
+        "distincte": len(corpus),
+        "deja_etichetate": reused,
+        "de_etichetat": len(de_etichetat),
+        "apeluri_llm": apeluri,
+        "seed": len(seed),
+    }
+
+
+def _valid_labels() -> set[str]:
+    et = _load_eticheteaza()
+    return set(et.ALL_LABELS)
+
+
+def _construieste_seed(corpus, harta, *, confidence, source) -> list[dict]:
+    """Seed ordonat determinist (pe cheie) -> byte-stabil intre rulari."""
+    out = []
+    for cheie in sorted(harta):
+        if cheie not in corpus:
+            continue  # eticheta fara corespondent in corpusul curent
+        eticheta = harta[cheie]
+        is_nul = eticheta == NUL_LABEL
+        out.append({
+            "denumire": corpus[cheie]["denumire"],
+            "denumire_normalizata": cheie,
+            "cod": None if is_nul else eticheta,
+            "is_nul": is_nul,
+            "source": source,
+            "confidence": confidence,
+        })
+    return out
+
+
+def _scrie_seed(seed_path: str, seed: list[dict]) -> None:
+    os.makedirs(os.path.dirname(os.path.abspath(seed_path)), exist_ok=True)
+    with open(seed_path, "w", encoding="utf-8") as fh:
+        json.dump(seed, fh, ensure_ascii=False, indent=2)
+        fh.write("\n")
+
+
+# --------------------------------------------------------------------------- #
+# CLI                                                                          #
+# --------------------------------------------------------------------------- #
+
+def main(argv=None):
+    ap = argparse.ArgumentParser(description="Genereaza seed etichetat operatie->cod (LM Studio).")
+    ap.add_argument("--target-volum", type=float, default=0.9,
+                    help="prag de acoperire pe volum (default 0.9 = D1)")
+    ap.add_argument("--all", action="store_true", help="eticheteaza tot corpusul, ignora pragul")
+    ap.add_argument("--batch", type=int, default=32, help="dimensiune batch (conservator: 32-40)")
+    ap.add_argument("--pauza", type=float, default=1.5,
+                    help="secunde de pauza intre batch-uri (ragaz termic GPU); 0 = fara")
+    ap.add_argument("--checkpoint-every", type=int, default=1,
+                    help="scrie seedul la fiecare N batch-uri (1 = dupa fiecare, crash-safe)")
+    ap.add_argument("--confidence", type=float, default=DEFAULT_CONFIDENCE)
+    ap.add_argument("--csv-glob", default=DEFAULT_CSV_GLOB)
+    ap.add_argument("--labels", default=DEFAULT_LABELS)
+    ap.add_argument("--seed", default=DEFAULT_SEED)
+    args = ap.parse_args(argv)
+
+    csv_paths = sorted(glob.glob(args.csv_glob))
+    if not csv_paths:
+        ap.error(f"niciun CSV gasit la {args.csv_glob}")
+
+    stats = genereaza(
+        csv_paths,
+        labels_path=args.labels,
+        seed_path=args.seed,
+        target_volum=args.target_volum,
+        etichetare_all=args.all,
+        batch=args.batch,
+        pauza=args.pauza,
+        checkpoint_every=args.checkpoint_every,
+        confidence=args.confidence,
+        progres=lambda m: print(m, flush=True),
+    )
+    print("GATA:", json.dumps(stats, ensure_ascii=False))
+
+
+if __name__ == "__main__":
+    main()