Files
echo-core/tools/tts.py
Marius Mutu d1bc77e87d feat(voice): polish voice loop UX — filler kill, barge-in, DTX flush, time/RO TTS
End-to-end voice UX iteration after DAVE E2E shipped. Each change addresses a
real symptom Marius hit in live testing today:

- Kill the 3s filler ("mă gândesc"): Claude p50 is 4-7s so the filler always
  fired BEFORE the response and collided with it. Removed all filler infra
  from pipeline.py + tts_stream.py (FILLER_DELAY_S, _filler_task, push_filler,
  load_thinking_wav, thinking.wav cache).

- Barge-in: ttsq.clear() at the top of on_segment_done drops stale frames so
  a new utterance cuts off Echo's previous response cleanly.

- DTX silence flush: Discord stops sending RTP packets when the user goes
  silent (DTX), so the inline silence-check in sink.write() never fired for
  the trailing audio of an utterance — STT was missed entirely. Added a
  background poller thread that checks the silence-flush condition every
  200ms independent of incoming packets.

- Discord audio cadence fix: EchoStreamingAudioSource.read() blocked 100ms
  per call when pcm_queue was empty, wrecking Discord's 20ms frame pacing →
  client interpreted the stream as stutter and discarded leading frames
  (Marius heard "4 de minute în București" instead of the full sentence).
  Switched to get_frame_nowait() — instant return, silence frame on empty.

- RO time expansion: "23:09" was being read as "douăzeci și trei:nouă"
  with literal colon. Added expand_time() with feminine-correct minute
  formatting (un minut / două minute / douăzeci de minute / una de minute).

- Supertonic Unicode sanitize centralized in tools/tts.py: Romanian curly
  quotes (`„`, `"`, `"`, `—`, `…`) crash Supertonic with HTTP 500. Map them
  to ASCII at the synthesize() entry so BOTH voice mode and /audio command
  are covered without duplication. normalize.py re-exports for compat.

- Whisper offline: WhisperModel(..., local_files_only=True) — no more
  huggingface.co metadata GET on every startup. Model is already cached.

- Diagnostic logging across the chain: sink first-packet, VAD first-speech,
  voice stream block (Claude → callback), push_text (text → clauses queued),
  TTS pushed (clauses → frames). Lets future "spoke but Echo silent" bugs
  pinpoint exactly where the chain breaks.

- Captured Supertonic curly-quote lesson in tasks/lessons.md.

All 76 voice tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 20:33:24 +00:00

117 lines
3.7 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#!/usr/bin/env python3
"""Text-to-speech via Supertonic local server.
CLI:
python3 tools/tts.py --text "Salut Marius" [--voice M1] [--lang ro]
→ stdout: {"ok": true, "path": "/tmp/echo-tts-xxx.wav", "size_bytes": 12345}
→ stdout: {"ok": false, "error": "..."}
Module:
from tools.tts import synthesize
result = synthesize("text", voice="M1", lang="ro")
"""
import argparse
import json
import sys
import tempfile
import httpx
SUPERTONIC_URL = "http://127.0.0.1:7788"
VOICES = {"M1", "M2", "M3", "M4", "M5", "F1", "F2", "F3", "F4", "F5"}
DEFAULT_VOICE = "M2"
DEFAULT_LANG = "ro"
# Punctuation Supertonic synthesis rejects with HTTP 500 (Romanian curly quotes,
# smart dashes, ellipsis, angle quotes). Mapped to ASCII so a stray „foo" in
# any caller's text doesn't kill the whole request.
_TTS_PUNCT_MAP = {
'': '"', '': '"', '': '"',
'': "'", '': "'", '': "'",
'«': '"', '»': '"',
'': '-', '': '-',
'': '...',
}
def sanitize_for_supertonic(text: str) -> str:
"""Replace Unicode punctuation Supertonic rejects with ASCII equivalents."""
for src, dst in _TTS_PUNCT_MAP.items():
text = text.replace(src, dst)
return text
def synthesize(text: str, voice: str = DEFAULT_VOICE, lang: str = DEFAULT_LANG) -> dict:
"""Call Supertonic server and save audio to a temp WAV file.
Returns:
{"ok": True, "path": "/tmp/echo-tts-xxx.wav", "size_bytes": N}
{"ok": False, "error": "mesaj eroare"}
"""
if not text or not text.strip():
return {"ok": False, "error": "Text gol."}
text = sanitize_for_supertonic(text)
voice = voice.upper()
if voice not in VOICES:
voice = DEFAULT_VOICE
try:
resp = httpx.post(
f"{SUPERTONIC_URL}/v1/audio/speech",
json={
"model": "supertonic-3",
"input": text,
"voice": voice,
"response_format": "wav",
"lang": lang,
},
timeout=60.0,
)
resp.raise_for_status()
except httpx.ConnectError:
return {
"ok": False,
"error": (
"Serverul Supertonic nu rulează pe :7788. "
"Pornește cu: systemctl --user start supertonic-tts"
),
}
except httpx.HTTPStatusError as e:
body = e.response.text[:300]
# Fallback: dacă lang=ro eșuează, încearcă na (language-agnostic)
if lang != "na":
return synthesize(text, voice=voice, lang="na")
return {"ok": False, "error": f"HTTP {e.response.status_code}: {body}"}
except Exception as e:
return {"ok": False, "error": str(e)}
# Salvează în fișier temp
try:
fd, path = tempfile.mkstemp(prefix="echo-tts-", suffix=".wav")
with open(fd, "wb") as f:
f.write(resp.content)
return {"ok": True, "path": path, "size_bytes": len(resp.content)}
except Exception as e:
return {"ok": False, "error": f"Scriere fișier: {e}"}
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Supertonic TTS CLI")
parser.add_argument("--text", required=True, help="Text de convertit în audio")
parser.add_argument(
"--voice", default=DEFAULT_VOICE,
help="Voce: M1-M5 (masculin) sau F1-F5 (feminin). Default: M1"
)
parser.add_argument(
"--lang", default=DEFAULT_LANG,
help="Limbă (ro, en, na). Default: ro. Fallback automat la na dacă ro eșuează."
)
args = parser.parse_args()
result = synthesize(args.text, voice=args.voice, lang=args.lang)
print(json.dumps(result, ensure_ascii=False))
sys.exit(0 if result.get("ok") else 1)