feat(voice): voice-mode prompt, isolated session, units, verbal voice swap, fast barge-in
Second voice UX iteration. Targets Marius's live-test pain points from today.
- **Voice-mode system prompt** (personality/VOICE_MODE.md, plumbed via
claude_session.build_system_prompt(voice_mode=True)) — when the voice
adapter starts a session, append voice-tailored instructions: short replies,
no markdown, no abbreviations, time without seconds, distances rounded
to "mii"/"milioane", no curly quotes / em-dash / ellipsis. Marius asked
for a "in-the-car friend" persona for voice.
- **Isolated voice session key** (router.py) — voice mode uses
`voice:<channel_id>` so it doesn't share context with the text adapter
on the same Discord channel. Fresh start, voice prompt applied
automatically without `/clear` ceremony. `/clear` drops both keys.
- **Metric units + Romanian thousands** (src/voice/normalize.py) —
`384.000 km` was being read as "trei sute optzeci și patru virgulă zero
zero zero km" because the dot was treated as decimal separator and `km`
wasn't expanded. New `normalize_thousands` collapses Romanian thousands
separators (`X.000`/`X.000.000`) before number expansion, and
`expand_units` handles km/kg/cm/mm/ml/ha/mp with correct Romanian
pluralization ("un kilometru", "două kilograme", "douăzeci de
centimetri", "o sută de kilometri" with "de" particle).
- **`/voice setvoice <M1-F5>` slash command** (discord_voice.py) — Discord
native autocomplete; swaps the live TTSQueue voice_id AND persists
voice.default_voice to config.json. No restart needed.
- **Verbal voice change** (src/voice/voice_commands.py — new module +
29 tests) — say "schimbă vocea pe M5" / "vorbește cu vocea F3" / "voce
em cinci" from inside the voice channel. Detector requires both a
trigger word (voce/vorbește/schimbă/treci pe) and a recognizable voice
ID (direct "M5", word form "em cinci", or fallback substring match for
Whisper-mangled forms like "unul cinci"=M5 and "Mâcinci"=M5). On
detection: live-swap, persist to config, mirror to chat with
`🎤 ... / 🔊 Voce → M5`, speak short ack in the NEW voice, skip
Claude. "pământinci" still can't be recovered (no recoverable digit
substring); user gets passthrough to Claude in that case.
- **Whisper initial_prompt** now lists the voice-command vocabulary so
STT biases toward producing clean "M5" / "F3" tokens instead of
inventing "pământ" / "unul" phonetic neighbors.
- **Fast barge-in** (pipeline.py EchoVoiceSink) — previously `ttsq.clear()`
only fired in `on_segment_done` (after 800ms silence + 2-3s STT ≈ 3s lag).
Now also fires from the sink as soon as VAD detects ≥2 consecutive
windows (~200ms) of sustained speech on Marius's user while Echo has
pending TTS frames. Single-window glitches don't cut Echo off; sustained
speech does. (Acoustic echo bleed-through still requires headphones —
no AEC in the bot.)
- Tests: 130 voice + router tests pass; updated test_router.py to expect
`/clear` to drop both text and voice session keys.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -105,9 +105,11 @@
|
|||||||
"url": "http://10.0.20.161:11434"
|
"url": "http://10.0.20.161:11434"
|
||||||
},
|
},
|
||||||
"voice": {
|
"voice": {
|
||||||
"allowed_user_ids": ["949388626146517022"],
|
"allowed_user_ids": [
|
||||||
|
"949388626146517022"
|
||||||
|
],
|
||||||
"user_name": "Marius",
|
"user_name": "Marius",
|
||||||
"default_voice": "M2",
|
"default_voice": "M5",
|
||||||
"auto_leave_minutes": 5
|
"auto_leave_minutes": 5
|
||||||
},
|
},
|
||||||
"paths": {
|
"paths": {
|
||||||
|
|||||||
30
personality/VOICE_MODE.md
Normal file
30
personality/VOICE_MODE.md
Normal file
@@ -0,0 +1,30 @@
|
|||||||
|
# Voice Mode
|
||||||
|
|
||||||
|
Răspunzi prin voce (TTS). Marius te aude — nu citește. Reguli care contează:
|
||||||
|
|
||||||
|
## Lungime și ton
|
||||||
|
|
||||||
|
- **Scurt**: 1-2 propoziții, max ~30 cuvinte per turn. Marius vorbește cu tine — nu redactezi un document.
|
||||||
|
- **Conversațional**: ca un om viu. Fără "Sigur, iată...", "Permite-mi să...", "Te rog să...". Direct la subiect.
|
||||||
|
- **Fără markdown**: zero bullet points, zero `**bold**`, zero ``code blocks``, zero linkuri. Totul e citit cu voce.
|
||||||
|
|
||||||
|
## Numere și unități
|
||||||
|
|
||||||
|
- **Ora**: fără secunde. Spune "ora 23 și 9 minute" sau "9 și jumătate", nu "23:09:42".
|
||||||
|
- **Distanțe mari**: rotunjește în "mii" sau "milioane". Pentru Pământ-Lună spune "384 mii de kilometri", nu "384.000 km".
|
||||||
|
- **Zecimale**: omite-le când nu adaugă informație. "5 lei" nu "5,00 lei". "două ore" nu "2,0 ore". "20 de minute" nu "20,5 minute".
|
||||||
|
- **Unități scrise**: pipeline-ul TTS expandează `km`/`kg`/`cm`/`mm`/`ml`/`ha`/`mp` automat, dar evită abrevieri rare. Scrie "metri" nu "m." dacă e ambiguu.
|
||||||
|
|
||||||
|
## Structură
|
||||||
|
|
||||||
|
- Listă scurtă verbală: "Trei lucruri: întâi X, apoi Y, plus Z."
|
||||||
|
- Listă lungă: spune 1-2 propoziții esențiale prin voce, restul scrie în chat cu o frază tip "Restul l-am scris în chat".
|
||||||
|
- Întrebări clarificatoare: pune UNA, nu trei.
|
||||||
|
|
||||||
|
## Punctuație
|
||||||
|
|
||||||
|
- Doar virgule și puncte. Fără `„` `"` `—` `…` `«»` — pipeline-ul oricum le sanitizează, dar evită-le să eviți pauzele forțate.
|
||||||
|
|
||||||
|
## Tu ești Marius's prieten în mașină
|
||||||
|
|
||||||
|
Imaginează-ți că Marius conduce și te-a întrebat ceva pe difuzor. Răspunzi natural, scurt, la subiect — fără ceremonii.
|
||||||
@@ -246,6 +246,45 @@ def register(tree: app_commands.CommandTree, bot: discord.Client) -> app_command
|
|||||||
log.warning("Presence reset skipped", exc_info=True)
|
log.warning("Presence reset skipped", exc_info=True)
|
||||||
await interaction.followup.send("Plecat.", ephemeral=True)
|
await interaction.followup.send("Plecat.", ephemeral=True)
|
||||||
|
|
||||||
|
_VOICE_CHOICES = [
|
||||||
|
app_commands.Choice(name=v, value=v)
|
||||||
|
for v in ("M1", "M2", "M3", "M4", "M5", "F1", "F2", "F3", "F4", "F5")
|
||||||
|
]
|
||||||
|
|
||||||
|
@voice_group.command(name="setvoice", description="Schimbă vocea Echo (M1-M5 sau F1-F5)")
|
||||||
|
@app_commands.describe(voice="Voce nouă")
|
||||||
|
@app_commands.choices(voice=_VOICE_CHOICES)
|
||||||
|
async def setvoice(
|
||||||
|
interaction: discord.Interaction,
|
||||||
|
voice: app_commands.Choice[str],
|
||||||
|
) -> None:
|
||||||
|
await interaction.response.defer(ephemeral=True)
|
||||||
|
new_voice = voice.value
|
||||||
|
# Live-swap on the active session if Echo is in voice on this guild.
|
||||||
|
guild_id = interaction.guild.id if interaction.guild else None
|
||||||
|
session = _voice_sessions.get(guild_id) if guild_id is not None else None
|
||||||
|
live_swapped = False
|
||||||
|
if session is not None and session.ttsq is not None:
|
||||||
|
session.ttsq.voice_id = new_voice
|
||||||
|
live_swapped = True
|
||||||
|
# Persist as the new default for future sessions.
|
||||||
|
try:
|
||||||
|
cfg = Config()
|
||||||
|
cfg.set("voice.default_voice", new_voice)
|
||||||
|
cfg.save()
|
||||||
|
except Exception as e:
|
||||||
|
log.warning("config save failed for new default voice: %s", e)
|
||||||
|
await interaction.followup.send(
|
||||||
|
f"Voce schimbată live ({new_voice}), dar config-ul nu s-a salvat: {e}",
|
||||||
|
ephemeral=True,
|
||||||
|
)
|
||||||
|
return
|
||||||
|
if live_swapped:
|
||||||
|
msg = f"Vocea schimbată **live** pe {new_voice}. Următoarea frază va folosi vocea nouă."
|
||||||
|
else:
|
||||||
|
msg = f"Default voce setată {new_voice}. Va intra în vigoare la următorul /voice join."
|
||||||
|
await interaction.followup.send(msg, ephemeral=True)
|
||||||
|
|
||||||
@voice_group.command(name="doctor", description="Verifică voice stack")
|
@voice_group.command(name="doctor", description="Verifică voice stack")
|
||||||
async def doctor(interaction: discord.Interaction) -> None:
|
async def doctor(interaction: discord.Interaction) -> None:
|
||||||
await interaction.response.defer(ephemeral=True)
|
await interaction.response.defer(ephemeral=True)
|
||||||
|
|||||||
@@ -399,15 +399,23 @@ def _run_claude(
|
|||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
def build_system_prompt() -> str:
|
def build_system_prompt(voice_mode: bool = False) -> str:
|
||||||
"""Concatenate personality/*.md files into a single system prompt."""
|
"""Concatenate personality/*.md files into a single system prompt.
|
||||||
|
|
||||||
|
When ``voice_mode=True``, appends ``VOICE_MODE.md`` so the model knows
|
||||||
|
its reply will be read aloud (terse, no markdown, no abbreviations, etc.).
|
||||||
|
"""
|
||||||
if not PERSONALITY_DIR.is_dir():
|
if not PERSONALITY_DIR.is_dir():
|
||||||
raise FileNotFoundError(
|
raise FileNotFoundError(
|
||||||
f"Personality directory not found: {PERSONALITY_DIR}"
|
f"Personality directory not found: {PERSONALITY_DIR}"
|
||||||
)
|
)
|
||||||
|
|
||||||
|
files = list(PERSONALITY_FILES)
|
||||||
|
if voice_mode:
|
||||||
|
files.append("VOICE_MODE.md")
|
||||||
|
|
||||||
parts: list[str] = []
|
parts: list[str] = []
|
||||||
for filename in PERSONALITY_FILES:
|
for filename in files:
|
||||||
filepath = PERSONALITY_DIR / filename
|
filepath = PERSONALITY_DIR / filename
|
||||||
if filepath.is_file():
|
if filepath.is_file():
|
||||||
parts.append(filepath.read_text(encoding="utf-8"))
|
parts.append(filepath.read_text(encoding="utf-8"))
|
||||||
@@ -434,6 +442,7 @@ def start_session(
|
|||||||
model: str = DEFAULT_MODEL,
|
model: str = DEFAULT_MODEL,
|
||||||
timeout: int = DEFAULT_TIMEOUT,
|
timeout: int = DEFAULT_TIMEOUT,
|
||||||
on_text: Callable[[str], None] | None = None,
|
on_text: Callable[[str], None] | None = None,
|
||||||
|
voice_mode: bool = False,
|
||||||
) -> tuple[str, str]:
|
) -> tuple[str, str]:
|
||||||
"""Start a new Claude CLI session for a channel.
|
"""Start a new Claude CLI session for a channel.
|
||||||
|
|
||||||
@@ -441,13 +450,16 @@ def start_session(
|
|||||||
|
|
||||||
If *on_text* is provided, each intermediate Claude text block is passed
|
If *on_text* is provided, each intermediate Claude text block is passed
|
||||||
to the callback as soon as it arrives.
|
to the callback as soon as it arrives.
|
||||||
|
|
||||||
|
*voice_mode* — when True, ``VOICE_MODE.md`` is appended to the system
|
||||||
|
prompt so the model produces short, TTS-friendly responses.
|
||||||
"""
|
"""
|
||||||
if model not in VALID_MODELS:
|
if model not in VALID_MODELS:
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
f"Invalid model '{model}'. Must be one of: haiku, sonnet, opus"
|
f"Invalid model '{model}'. Must be one of: haiku, sonnet, opus"
|
||||||
)
|
)
|
||||||
|
|
||||||
system_prompt = build_system_prompt()
|
system_prompt = build_system_prompt(voice_mode=voice_mode)
|
||||||
|
|
||||||
# Wrap external user message with injection protection markers
|
# Wrap external user message with injection protection markers
|
||||||
wrapped_message = f"[EXTERNAL CONTENT]\n{message}\n[END EXTERNAL CONTENT]"
|
wrapped_message = f"[EXTERNAL CONTENT]\n{message}\n[END EXTERNAL CONTENT]"
|
||||||
@@ -578,6 +590,7 @@ def send_message(
|
|||||||
model: str = DEFAULT_MODEL,
|
model: str = DEFAULT_MODEL,
|
||||||
timeout: int = DEFAULT_TIMEOUT,
|
timeout: int = DEFAULT_TIMEOUT,
|
||||||
on_text: Callable[[str], None] | None = None,
|
on_text: Callable[[str], None] | None = None,
|
||||||
|
voice_mode: bool = False,
|
||||||
) -> str:
|
) -> str:
|
||||||
"""High-level convenience: auto start or resume based on channel state.
|
"""High-level convenience: auto start or resume based on channel state.
|
||||||
|
|
||||||
@@ -598,7 +611,8 @@ def send_message(
|
|||||||
if session is not None and session.get("model"):
|
if session is not None and session.get("model"):
|
||||||
effective_model = session["model"]
|
effective_model = session["model"]
|
||||||
response_text, _session_id = start_session(
|
response_text, _session_id = start_session(
|
||||||
channel_id, message, effective_model, timeout, on_text=on_text
|
channel_id, message, effective_model, timeout,
|
||||||
|
on_text=on_text, voice_mode=voice_mode,
|
||||||
)
|
)
|
||||||
return response_text
|
return response_text
|
||||||
|
|
||||||
|
|||||||
@@ -123,8 +123,10 @@ def route_message(
|
|||||||
# Text-based commands (not slash commands — these work in any adapter)
|
# Text-based commands (not slash commands — these work in any adapter)
|
||||||
if text.lower() == "/clear":
|
if text.lower() == "/clear":
|
||||||
default_model = _get_config().get("bot.default_model", "sonnet")
|
default_model = _get_config().get("bot.default_model", "sonnet")
|
||||||
cleared = clear_session(channel_id)
|
cleared_text = clear_session(channel_id)
|
||||||
if cleared:
|
# Also drop the isolated voice session if one exists on this channel.
|
||||||
|
clear_session(f"voice:{channel_id}")
|
||||||
|
if cleared_text:
|
||||||
return f"Session cleared. Model reset to {default_model}.", True
|
return f"Session cleared. Model reset to {default_model}.", True
|
||||||
return "No active session.", True
|
return "No active session.", True
|
||||||
|
|
||||||
@@ -159,12 +161,19 @@ def route_message(
|
|||||||
# (Engineering decision #14 in the plan.) Only the discord-voice adapter
|
# (Engineering decision #14 in the plan.) Only the discord-voice adapter
|
||||||
# triggers it — text adapters keep the message verbatim.
|
# triggers it — text adapters keep the message verbatim.
|
||||||
claude_text = text
|
claude_text = text
|
||||||
if adapter_name == "discord-voice":
|
voice_mode = adapter_name == "discord-voice"
|
||||||
|
if voice_mode:
|
||||||
user_name = _get_config().get("voice.user_name", "user") or "user"
|
user_name = _get_config().get("voice.user_name", "user") or "user"
|
||||||
claude_text = f"[speaker:{user_name}] {text}"
|
claude_text = f"[speaker:{user_name}] {text}"
|
||||||
|
# Voice sessions use an isolated session key so they start fresh with
|
||||||
|
# VOICE_MODE.md and don't pollute the text channel's conversation.
|
||||||
|
session_key = f"voice:{channel_id}" if voice_mode else channel_id
|
||||||
|
|
||||||
try:
|
try:
|
||||||
response = send_message(channel_id, claude_text, model=model, on_text=on_text)
|
response = send_message(
|
||||||
|
session_key, claude_text, model=model, on_text=on_text,
|
||||||
|
voice_mode=voice_mode,
|
||||||
|
)
|
||||||
_set_last_response(channel_id, response)
|
_set_last_response(channel_id, response)
|
||||||
return response, False
|
return response, False
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
|
|||||||
@@ -94,6 +94,55 @@ def expand_numbers_ro(text: str) -> str:
|
|||||||
return _NUM_TOKEN.sub(_sub, text)
|
return _NUM_TOKEN.sub(_sub, text)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------- Thousands separator ----------
|
||||||
|
|
||||||
|
# Romanian uses dot or space as thousands separator: 384.000 / 384 000. The
|
||||||
|
# decimal expander would read "384.000" as "trei sute optzeci și patru virgulă
|
||||||
|
# zero zero zero" — wrong. Collapse the dots so expand_numbers_ro reads the
|
||||||
|
# whole integer. Only 1-3 leading digits followed by ≥1 group of exactly 3
|
||||||
|
# digits, never adjacent to other digits.
|
||||||
|
_THOUSANDS_DOT = re.compile(r'(?<!\d)(\d{1,3}(?:\.\d{3})+)(?!\d)')
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_thousands(text: str) -> str:
|
||||||
|
"""Strip the dot from Romanian thousands-separator integers."""
|
||||||
|
return _THOUSANDS_DOT.sub(lambda m: m.group(1).replace('.', ''), text)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------- Metric units ----------
|
||||||
|
|
||||||
|
# (regex_matching_<n><unit>, singular, plural). Matches an integer or decimal
|
||||||
|
# followed by the abbreviation as a whole word. Skipping bare ``m`` and ``l``
|
||||||
|
# because they collide with too many tokens ("M2" voice id, list markers).
|
||||||
|
_UNIT_PATTERNS: list[tuple[re.Pattern, str, str]] = [
|
||||||
|
(re.compile(r'(?<!\w)(\d+(?:[.,]\d+)?)\s*km\b', re.IGNORECASE), 'kilometru', 'kilometri'),
|
||||||
|
(re.compile(r'(?<!\w)(\d+(?:[.,]\d+)?)\s*kg\b', re.IGNORECASE), 'kilogram', 'kilograme'),
|
||||||
|
(re.compile(r'(?<!\w)(\d+(?:[.,]\d+)?)\s*cm\b', re.IGNORECASE), 'centimetru', 'centimetri'),
|
||||||
|
(re.compile(r'(?<!\w)(\d+(?:[.,]\d+)?)\s*mm\b', re.IGNORECASE), 'milimetru', 'milimetri'),
|
||||||
|
(re.compile(r'(?<!\w)(\d+(?:[.,]\d+)?)\s*ml\b', re.IGNORECASE), 'mililitru', 'mililitri'),
|
||||||
|
(re.compile(r'(?<!\w)(\d+(?:[.,]\d+)?)\s*ha\b', re.IGNORECASE), 'hectar', 'hectare'),
|
||||||
|
(re.compile(r'(?<!\w)(\d+(?:[.,]\d+)?)\s*mp\b', re.IGNORECASE), 'metru pătrat', 'metri pătrați'),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def _format_unit(amount_str: str, singular: str, plural: str) -> str:
|
||||||
|
"""Mirror ``_format_currency_unit`` for metric units. Decimals fall through
|
||||||
|
to the generic decimal expander (which leaves them with plural form)."""
|
||||||
|
if '.' in amount_str or ',' in amount_str:
|
||||||
|
return f"{_decimal_to_ro(amount_str.replace(',', '.'))} {plural}"
|
||||||
|
return _format_currency_unit(int(amount_str), singular, plural)
|
||||||
|
|
||||||
|
|
||||||
|
def expand_units(text: str) -> str:
|
||||||
|
"""Expand metric unit abbreviations into spoken Romanian."""
|
||||||
|
for pattern, singular, plural in _UNIT_PATTERNS:
|
||||||
|
text = pattern.sub(
|
||||||
|
lambda m, sg=singular, pl=plural: _format_unit(m.group(1), sg, pl),
|
||||||
|
text,
|
||||||
|
)
|
||||||
|
return text
|
||||||
|
|
||||||
|
|
||||||
# ---------- Time ----------
|
# ---------- Time ----------
|
||||||
|
|
||||||
_TIME_PATTERN = re.compile(r'(?<!\d)([01]?\d|2[0-3]):([0-5]?\d)(?!\d)')
|
_TIME_PATTERN = re.compile(r'(?<!\d)([01]?\d|2[0-3]):([0-5]?\d)(?!\d)')
|
||||||
@@ -257,8 +306,10 @@ def normalize_for_tts(text: str) -> str:
|
|||||||
text = strip_markdown(text)
|
text = strip_markdown(text)
|
||||||
text = sanitize_punctuation(text)
|
text = sanitize_punctuation(text)
|
||||||
text = expand_abbreviations(text)
|
text = expand_abbreviations(text)
|
||||||
|
text = normalize_thousands(text)
|
||||||
text = expand_time(text)
|
text = expand_time(text)
|
||||||
text = expand_currency(text)
|
text = expand_currency(text)
|
||||||
|
text = expand_units(text)
|
||||||
text = expand_numbers_ro(text)
|
text = expand_numbers_ro(text)
|
||||||
text = expand_symbols(text)
|
text = expand_symbols(text)
|
||||||
words = text.split()
|
words = text.split()
|
||||||
|
|||||||
@@ -34,6 +34,7 @@ from typing import Any, Callable, Optional
|
|||||||
import numpy as np
|
import numpy as np
|
||||||
|
|
||||||
from src.voice._discord_voice_adapter import AudioSink, VoiceData
|
from src.voice._discord_voice_adapter import AudioSink, VoiceData
|
||||||
|
from src.voice.voice_commands import detect_voice_change
|
||||||
|
|
||||||
log = logging.getLogger(__name__)
|
log = logging.getLogger(__name__)
|
||||||
|
|
||||||
@@ -274,6 +275,12 @@ class VoiceSession:
|
|||||||
except Exception as e: # noqa: BLE001
|
except Exception as e: # noqa: BLE001
|
||||||
log.warning("ttsq.clear failed: %s", e)
|
log.warning("ttsq.clear failed: %s", e)
|
||||||
|
|
||||||
|
# In-band voice command: change TTS voice without round-tripping Claude.
|
||||||
|
new_voice = detect_voice_change(text)
|
||||||
|
if new_voice is not None:
|
||||||
|
await self._handle_voice_change(speaker_name, text, new_voice)
|
||||||
|
return
|
||||||
|
|
||||||
# 1. Mirror to text channel (one Unicode 🎤 — exception per plan).
|
# 1. Mirror to text channel (one Unicode 🎤 — exception per plan).
|
||||||
if self.mirror_enabled and self.text_channel is not None:
|
if self.mirror_enabled and self.text_channel is not None:
|
||||||
try:
|
try:
|
||||||
@@ -327,6 +334,45 @@ class VoiceSession:
|
|||||||
except Exception as e: # noqa: BLE001
|
except Exception as e: # noqa: BLE001
|
||||||
log.error("route_message voice path failed: %s", e)
|
log.error("route_message voice path failed: %s", e)
|
||||||
|
|
||||||
|
async def _handle_voice_change(
|
||||||
|
self, speaker_name: str, original_text: str, new_voice: str,
|
||||||
|
) -> None:
|
||||||
|
"""Apply an in-band 'change voice' command: swap live, persist to
|
||||||
|
config, mirror to chat, speak a short acknowledgment in the new voice.
|
||||||
|
Does NOT forward the utterance to Claude."""
|
||||||
|
# 1. Live-swap on the TTS queue. Next clause synth uses the new voice.
|
||||||
|
try:
|
||||||
|
self.ttsq.voice_id = new_voice
|
||||||
|
except Exception as e: # noqa: BLE001
|
||||||
|
log.warning("ttsq voice swap failed: %s", e)
|
||||||
|
# 2. Persist as the new default for future sessions.
|
||||||
|
try:
|
||||||
|
from src.config import Config
|
||||||
|
cfg = Config()
|
||||||
|
cfg.set("voice.default_voice", new_voice)
|
||||||
|
cfg.save()
|
||||||
|
except Exception as e: # noqa: BLE001
|
||||||
|
log.warning("voice default persist failed: %s", e)
|
||||||
|
# 3. Mirror what was heard + show the swap in the text channel.
|
||||||
|
if self.mirror_enabled and self.text_channel is not None:
|
||||||
|
try:
|
||||||
|
send = getattr(self.text_channel, "send", None)
|
||||||
|
if callable(send):
|
||||||
|
coro = send(
|
||||||
|
f"\U0001f3a4 {speaker_name}: \"{original_text}\"\n"
|
||||||
|
f"\U0001f50a Voce → **{new_voice}**"
|
||||||
|
)
|
||||||
|
if asyncio.iscoroutine(coro):
|
||||||
|
await coro
|
||||||
|
except Exception as e: # noqa: BLE001
|
||||||
|
log.warning("voice mirror send failed: %s", e)
|
||||||
|
# 4. Verbal acknowledgment in the NEW voice.
|
||||||
|
try:
|
||||||
|
self.ttsq.push_text(f"Vocea {new_voice}.")
|
||||||
|
except Exception as e: # noqa: BLE001
|
||||||
|
log.warning("voice ack push failed: %s", e)
|
||||||
|
self._log_metric({"event": "voice_change", "new_voice": new_voice})
|
||||||
|
|
||||||
# ----- helpers -----
|
# ----- helpers -----
|
||||||
|
|
||||||
def _resolve_speaker_name(self, speaker_id: int) -> str:
|
def _resolve_speaker_name(self, speaker_id: int) -> str:
|
||||||
@@ -381,6 +427,10 @@ class EchoVoiceSink(AudioSink):
|
|||||||
# chain breaks when "I spoke but Echo heard nothing" happens.
|
# chain breaks when "I spoke but Echo heard nothing" happens.
|
||||||
self._first_packet_logged: set[int] = set()
|
self._first_packet_logged: set[int] = set()
|
||||||
self._first_speech_logged: set[int] = set()
|
self._first_speech_logged: set[int] = set()
|
||||||
|
# Track consecutive VAD-positive windows per user. Used to delay
|
||||||
|
# barge-in (don't cut Echo off on a single jittery VAD hit; require
|
||||||
|
# ≥2 windows ≈ 200ms of sustained speech).
|
||||||
|
self._vad_consecutive: dict[int, int] = {}
|
||||||
# Background poller that triggers the silence flush even when Discord
|
# Background poller that triggers the silence flush even when Discord
|
||||||
# DTX stops delivering RTP packets after the user stops speaking. Without
|
# DTX stops delivering RTP packets after the user stops speaking. Without
|
||||||
# this, sink.write would stop firing and STT would never run on the
|
# this, sink.write would stop firing and STT would never run on the
|
||||||
@@ -444,9 +494,27 @@ class EchoVoiceSink(AudioSink):
|
|||||||
if uid not in self._first_speech_logged:
|
if uid not in self._first_speech_logged:
|
||||||
self._first_speech_logged.add(uid)
|
self._first_speech_logged.add(uid)
|
||||||
log.info("voice sink: VAD detected speech from user %s", uid)
|
log.info("voice sink: VAD detected speech from user %s", uid)
|
||||||
|
self._vad_consecutive[uid] = self._vad_consecutive.get(uid, 0) + 1
|
||||||
with self._sink_lock:
|
with self._sink_lock:
|
||||||
self._last_speech_ts[uid] = time.monotonic()
|
self._last_speech_ts[uid] = time.monotonic()
|
||||||
self._has_speech[uid] = True
|
self._has_speech[uid] = True
|
||||||
|
# Fast barge-in: after ≥2 consecutive VAD windows (~200ms
|
||||||
|
# of sustained speech), cut Echo's TTS mid-sentence so the
|
||||||
|
# user doesn't have to wait the full silence-flush + STT
|
||||||
|
# cycle (~3s).
|
||||||
|
if self._vad_consecutive[uid] >= 2:
|
||||||
|
try:
|
||||||
|
ttsq = self.session.ttsq
|
||||||
|
if ttsq is not None and not ttsq.is_empty():
|
||||||
|
ttsq.clear()
|
||||||
|
log.info(
|
||||||
|
"voice sink: barge-in cleared TTS queue (user=%s)",
|
||||||
|
uid,
|
||||||
|
)
|
||||||
|
except Exception as e: # noqa: BLE001
|
||||||
|
log.warning("barge-in clear failed: %s", e)
|
||||||
|
else:
|
||||||
|
self._vad_consecutive[uid] = 0
|
||||||
|
|
||||||
pcm_for_stt = self._take_flushable_pcm(uid)
|
pcm_for_stt = self._take_flushable_pcm(uid)
|
||||||
if pcm_for_stt:
|
if pcm_for_stt:
|
||||||
@@ -530,7 +598,10 @@ class EchoVoiceSink(AudioSink):
|
|||||||
mono16, language="ro", beam_size=5,
|
mono16, language="ro", beam_size=5,
|
||||||
initial_prompt=(
|
initial_prompt=(
|
||||||
"Echo Core, asistent personal AI românesc al lui Marius. "
|
"Echo Core, asistent personal AI românesc al lui Marius. "
|
||||||
"Conversație colocvială în română."
|
"Conversație colocvială în română. "
|
||||||
|
"Comenzi voce recunoscute: schimbă vocea pe M1, M2, M3, M4, M5, "
|
||||||
|
"F1, F2, F3, F4, F5. Exemple: vorbește cu vocea M5, voce F3, "
|
||||||
|
"treci pe vocea F1."
|
||||||
),
|
),
|
||||||
condition_on_previous_text=False,
|
condition_on_previous_text=False,
|
||||||
)
|
)
|
||||||
|
|||||||
118
src/voice/voice_commands.py
Normal file
118
src/voice/voice_commands.py
Normal file
@@ -0,0 +1,118 @@
|
|||||||
|
"""Detect in-band voice commands from STT transcripts.
|
||||||
|
|
||||||
|
The voice pipeline transcribes Marius's speech via Whisper and dispatches the
|
||||||
|
text to Claude. Some utterances are not questions for Claude — they're
|
||||||
|
control commands for the voice stack itself. This module parses those out
|
||||||
|
*before* the Claude round-trip so they take effect instantly and don't waste
|
||||||
|
a Claude session turn.
|
||||||
|
|
||||||
|
Currently handled:
|
||||||
|
* change TTS voice — "schimbă vocea pe M5", "vorbește cu vocea F3",
|
||||||
|
"voce em cinci", "voce feminină 3", etc.
|
||||||
|
|
||||||
|
The parser is intentionally conservative: it requires BOTH a voice trigger
|
||||||
|
word ("voce", "vorbește", "schimbă", "treci pe") AND a recognizable voice
|
||||||
|
ID. A bare "M5" without context is NOT a command — Marius might be quoting
|
||||||
|
a string.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import re
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
|
||||||
|
_VALID_VOICES = {f"M{i}" for i in range(1, 6)} | {f"F{i}" for i in range(1, 6)}
|
||||||
|
|
||||||
|
|
||||||
|
# Trigger words that suggest the user is talking ABOUT the voice, not just
|
||||||
|
# saying something that happens to contain a voice-ID-looking substring.
|
||||||
|
_VOICE_TRIGGER_RE = re.compile(
|
||||||
|
r'\b(voce|vocea|voci|voice|vorbe[șs]te|schimb[aăÎ]|treci\s+pe)\b',
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Direct form: "M5", "F 3", "m5", etc.
|
||||||
|
_VOICE_ID_DIRECT_RE = re.compile(
|
||||||
|
r'\b([MF])\s*([1-5])\b',
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Word form: "em cinci", "M trei", "masculin doi", "feminină patru", etc.
|
||||||
|
# Whisper often transcribes "M5" as "em cinci" / "M cinci" because letter
|
||||||
|
# names are spelled out phonetically in Romanian.
|
||||||
|
_VOICE_ID_WORDS_RE = re.compile(
|
||||||
|
r'\b(em|m|masculin[aăe]?|ef|f|feminin[aăe]?)\s+(unu|una|doi|dou[ăa]|trei|patru|cinci|[1-5])\b',
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
_DIGIT_WORD_TO_INT = {
|
||||||
|
'unu': 1, 'una': 1, 'unul': 1, '1': 1,
|
||||||
|
'doi': 2, 'două': 2, 'doua': 2, '2': 2,
|
||||||
|
'trei': 3, '3': 3,
|
||||||
|
'patru': 4, '4': 4,
|
||||||
|
'cinci': 5, '5': 5,
|
||||||
|
}
|
||||||
|
|
||||||
|
# Substring fallback: matches digit roots even when Whisper glues them into
|
||||||
|
# compound non-words like "Mâcinci" (for "M cinci"=M5).
|
||||||
|
_DIGIT_SUBSTR_RE = re.compile(
|
||||||
|
r'(cinci|patru|trei|dou[ăa]|unul|unu|una)',
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
|
||||||
|
_F_GENDER_HINT_RE = re.compile(r'feminin|\bef\b|\bF\d?\b', re.IGNORECASE)
|
||||||
|
|
||||||
|
|
||||||
|
def _normalize_gender(word: str) -> Optional[str]:
|
||||||
|
"""Map gender word to 'M' or 'F'."""
|
||||||
|
w = word.lower()
|
||||||
|
if w in ('m', 'em') or w.startswith('masculin'):
|
||||||
|
return 'M'
|
||||||
|
if w in ('f', 'ef') or w.startswith('feminin'):
|
||||||
|
return 'F'
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def detect_voice_change(text: str) -> Optional[str]:
|
||||||
|
"""Parse a transcript for a 'change voice' command.
|
||||||
|
|
||||||
|
Returns the target voice id (one of M1-M5, F1-F5) or None if no command
|
||||||
|
was detected. Requires both a voice trigger word and a voice ID.
|
||||||
|
"""
|
||||||
|
if not text:
|
||||||
|
return None
|
||||||
|
if not _VOICE_TRIGGER_RE.search(text):
|
||||||
|
return None
|
||||||
|
# Try the direct form first (M5, F3, etc.)
|
||||||
|
m = _VOICE_ID_DIRECT_RE.search(text)
|
||||||
|
if m:
|
||||||
|
candidate = f"{m.group(1).upper()}{m.group(2)}"
|
||||||
|
if candidate in _VALID_VOICES:
|
||||||
|
return candidate
|
||||||
|
# Fall back to the word form ("em cinci", "feminin trei", ...).
|
||||||
|
m = _VOICE_ID_WORDS_RE.search(text)
|
||||||
|
if m:
|
||||||
|
gender = _normalize_gender(m.group(1))
|
||||||
|
digit = _DIGIT_WORD_TO_INT.get(m.group(2).lower())
|
||||||
|
if gender is not None and digit is not None:
|
||||||
|
candidate = f"{gender}{digit}"
|
||||||
|
if candidate in _VALID_VOICES:
|
||||||
|
return candidate
|
||||||
|
# Permissive fallback: Whisper sometimes glues the letter into the next
|
||||||
|
# word ("Mâcinci" for "M cinci") or replaces it ("unul cinci" for
|
||||||
|
# "M unu cinci"). After a voice trigger word, scan for any digit-word
|
||||||
|
# substring and infer gender (F if a feminine marker is present, else M).
|
||||||
|
digit_hits = _DIGIT_SUBSTR_RE.findall(text)
|
||||||
|
digits = [_DIGIT_WORD_TO_INT[d.lower()] for d in digit_hits
|
||||||
|
if d.lower() in _DIGIT_WORD_TO_INT]
|
||||||
|
digits = [d for d in digits if 1 <= d <= 5]
|
||||||
|
if digits:
|
||||||
|
gender = 'F' if _F_GENDER_HINT_RE.search(text) else 'M'
|
||||||
|
# Last digit wins — handles "M unu cinci" → M5 since "unu" is a
|
||||||
|
# mangled letter-name prefix, "cinci" is the actual target.
|
||||||
|
return f"{gender}{digits[-1]}"
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["detect_voice_change"]
|
||||||
@@ -30,7 +30,10 @@ class TestClearCommand:
|
|||||||
response, is_cmd = route_message("ch-1", "user-1", "/clear")
|
response, is_cmd = route_message("ch-1", "user-1", "/clear")
|
||||||
assert response == "Session cleared. Model reset to sonnet."
|
assert response == "Session cleared. Model reset to sonnet."
|
||||||
assert is_cmd is True
|
assert is_cmd is True
|
||||||
mock_clear.assert_called_once_with("ch-1")
|
# /clear drops both the text-adapter session and the isolated voice
|
||||||
|
# session for the same Discord channel.
|
||||||
|
mock_clear.assert_any_call("ch-1")
|
||||||
|
mock_clear.assert_any_call("voice:ch-1")
|
||||||
|
|
||||||
@patch("src.router._get_config")
|
@patch("src.router._get_config")
|
||||||
@patch("src.router.clear_session")
|
@patch("src.router.clear_session")
|
||||||
@@ -191,7 +194,7 @@ class TestRegularMessage:
|
|||||||
response, is_cmd = route_message("ch-1", "user-1", "hello")
|
response, is_cmd = route_message("ch-1", "user-1", "hello")
|
||||||
assert response == "Hello from Claude!"
|
assert response == "Hello from Claude!"
|
||||||
assert is_cmd is False
|
assert is_cmd is False
|
||||||
mock_send.assert_called_once_with("ch-1", "hello", model="sonnet", on_text=None)
|
mock_send.assert_called_once_with("ch-1", "hello", model="sonnet", on_text=None, voice_mode=False)
|
||||||
|
|
||||||
@patch("src.router.send_message")
|
@patch("src.router.send_message")
|
||||||
def test_model_override(self, mock_send):
|
def test_model_override(self, mock_send):
|
||||||
@@ -199,7 +202,7 @@ class TestRegularMessage:
|
|||||||
response, is_cmd = route_message("ch-1", "user-1", "hello", model="opus")
|
response, is_cmd = route_message("ch-1", "user-1", "hello", model="opus")
|
||||||
assert response == "Response"
|
assert response == "Response"
|
||||||
assert is_cmd is False
|
assert is_cmd is False
|
||||||
mock_send.assert_called_once_with("ch-1", "hello", model="opus", on_text=None)
|
mock_send.assert_called_once_with("ch-1", "hello", model="opus", on_text=None, voice_mode=False)
|
||||||
|
|
||||||
@patch("src.router._get_channel_config")
|
@patch("src.router._get_channel_config")
|
||||||
@patch("src.router._get_config")
|
@patch("src.router._get_config")
|
||||||
@@ -227,7 +230,7 @@ class TestRegularMessage:
|
|||||||
|
|
||||||
cb = lambda t: None
|
cb = lambda t: None
|
||||||
route_message("ch-1", "user-1", "hello", on_text=cb)
|
route_message("ch-1", "user-1", "hello", on_text=cb)
|
||||||
mock_send.assert_called_once_with("ch-1", "hello", model="sonnet", on_text=cb)
|
mock_send.assert_called_once_with("ch-1", "hello", model="sonnet", on_text=cb, voice_mode=False)
|
||||||
|
|
||||||
|
|
||||||
# --- _get_channel_config ---
|
# --- _get_channel_config ---
|
||||||
@@ -269,7 +272,7 @@ class TestModelResolution:
|
|||||||
mock_chan_cfg.return_value = {"id": "ch-1", "default_model": "haiku"}
|
mock_chan_cfg.return_value = {"id": "ch-1", "default_model": "haiku"}
|
||||||
|
|
||||||
route_message("ch-1", "user-1", "hello")
|
route_message("ch-1", "user-1", "hello")
|
||||||
mock_send.assert_called_once_with("ch-1", "hello", model="haiku", on_text=None)
|
mock_send.assert_called_once_with("ch-1", "hello", model="haiku", on_text=None, voice_mode=False)
|
||||||
|
|
||||||
@patch("src.router._get_channel_config")
|
@patch("src.router._get_channel_config")
|
||||||
@patch("src.router._get_config")
|
@patch("src.router._get_config")
|
||||||
@@ -283,7 +286,7 @@ class TestModelResolution:
|
|||||||
mock_get_config.return_value = mock_cfg
|
mock_get_config.return_value = mock_cfg
|
||||||
|
|
||||||
route_message("ch-1", "user-1", "hello")
|
route_message("ch-1", "user-1", "hello")
|
||||||
mock_send.assert_called_once_with("ch-1", "hello", model="opus", on_text=None)
|
mock_send.assert_called_once_with("ch-1", "hello", model="opus", on_text=None, voice_mode=False)
|
||||||
|
|
||||||
@patch("src.router._get_channel_config")
|
@patch("src.router._get_channel_config")
|
||||||
@patch("src.router._get_config")
|
@patch("src.router._get_config")
|
||||||
@@ -297,7 +300,7 @@ class TestModelResolution:
|
|||||||
mock_get_config.return_value = mock_cfg
|
mock_get_config.return_value = mock_cfg
|
||||||
|
|
||||||
route_message("ch-1", "user-1", "hello")
|
route_message("ch-1", "user-1", "hello")
|
||||||
mock_send.assert_called_once_with("ch-1", "hello", model="sonnet", on_text=None)
|
mock_send.assert_called_once_with("ch-1", "hello", model="sonnet", on_text=None, voice_mode=False)
|
||||||
|
|
||||||
@patch("src.router.get_active_session")
|
@patch("src.router.get_active_session")
|
||||||
@patch("src.router.send_message")
|
@patch("src.router.send_message")
|
||||||
@@ -307,4 +310,4 @@ class TestModelResolution:
|
|||||||
mock_get_session.return_value = {"model": "opus", "session_id": "abc"}
|
mock_get_session.return_value = {"model": "opus", "session_id": "abc"}
|
||||||
|
|
||||||
route_message("ch-1", "user-1", "hello")
|
route_message("ch-1", "user-1", "hello")
|
||||||
mock_send.assert_called_once_with("ch-1", "hello", model="opus", on_text=None)
|
mock_send.assert_called_once_with("ch-1", "hello", model="opus", on_text=None, voice_mode=False)
|
||||||
|
|||||||
55
tests/test_voice_commands.py
Normal file
55
tests/test_voice_commands.py
Normal file
@@ -0,0 +1,55 @@
|
|||||||
|
"""Tests for src/voice/voice_commands.detect_voice_change."""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from src.voice.voice_commands import detect_voice_change
|
||||||
|
|
||||||
|
|
||||||
|
class TestDetectVoiceChange:
|
||||||
|
# --- positive cases (direct form) ---
|
||||||
|
@pytest.mark.parametrize("text,expected", [
|
||||||
|
("schimbă vocea pe M5", "M5"),
|
||||||
|
("Schimbă vocea pe F3.", "F3"),
|
||||||
|
("vorbește cu vocea M1", "M1"),
|
||||||
|
("vorbește cu vocea F2", "F2"),
|
||||||
|
("voce M4", "M4"),
|
||||||
|
("Voce F5.", "F5"),
|
||||||
|
("treci pe vocea F1", "F1"),
|
||||||
|
("Echo, treci pe M2.", "M2"),
|
||||||
|
("voice M3", "M3"),
|
||||||
|
])
|
||||||
|
def test_direct_form(self, text, expected):
|
||||||
|
assert detect_voice_change(text) == expected
|
||||||
|
|
||||||
|
# --- positive cases (word form, what Whisper actually produces) ---
|
||||||
|
@pytest.mark.parametrize("text,expected", [
|
||||||
|
("schimbă vocea pe em cinci", "M5"),
|
||||||
|
("vorbește cu vocea em trei", "M3"),
|
||||||
|
("voce em unu", "M1"),
|
||||||
|
("schimbă vocea pe ef doi", "F2"),
|
||||||
|
("voce ef cinci", "F5"),
|
||||||
|
("vorbește cu vocea masculină cinci", "M5"),
|
||||||
|
("schimbă vocea pe feminină trei", "F3"),
|
||||||
|
("voce masculin patru", "M4"),
|
||||||
|
("schimbă vocea pe M cinci", "M5"),
|
||||||
|
("voce F două", "F2"),
|
||||||
|
])
|
||||||
|
def test_word_form(self, text, expected):
|
||||||
|
assert detect_voice_change(text) == expected
|
||||||
|
|
||||||
|
# --- negative cases ---
|
||||||
|
@pytest.mark.parametrize("text", [
|
||||||
|
"",
|
||||||
|
"cât este ora",
|
||||||
|
"M5", # no trigger word
|
||||||
|
"Salut Echo, sunt în M3", # M3 here is a location/etc, no trigger
|
||||||
|
"vocea ta este foarte bună", # trigger but no voice id
|
||||||
|
"schimbă te rog", # trigger but no id
|
||||||
|
"voce M6", # out of range
|
||||||
|
"voce M0", # out of range
|
||||||
|
"voce F8", # out of range
|
||||||
|
"schimbă vocea pe șapte", # digit out of range
|
||||||
|
])
|
||||||
|
def test_no_match(self, text):
|
||||||
|
assert detect_voice_change(text) is None
|
||||||
Reference in New Issue
Block a user