feat(voice): voice-mode prompt, isolated session, units, verbal voice swap, fast barge-in

Second voice UX iteration. Targets Marius's live-test pain points from today.

- **Voice-mode system prompt** (personality/VOICE_MODE.md, plumbed via
  claude_session.build_system_prompt(voice_mode=True)) — when the voice
  adapter starts a session, append voice-tailored instructions: short replies,
  no markdown, no abbreviations, time without seconds, distances rounded
  to "mii"/"milioane", no curly quotes / em-dash / ellipsis. Marius asked
  for a "in-the-car friend" persona for voice.

- **Isolated voice session key** (router.py) — voice mode uses
  `voice:<channel_id>` so it doesn't share context with the text adapter
  on the same Discord channel. Fresh start, voice prompt applied
  automatically without `/clear` ceremony. `/clear` drops both keys.

- **Metric units + Romanian thousands** (src/voice/normalize.py) —
  `384.000 km` was being read as "trei sute optzeci și patru virgulă zero
  zero zero km" because the dot was treated as decimal separator and `km`
  wasn't expanded. New `normalize_thousands` collapses Romanian thousands
  separators (`X.000`/`X.000.000`) before number expansion, and
  `expand_units` handles km/kg/cm/mm/ml/ha/mp with correct Romanian
  pluralization ("un kilometru", "două kilograme", "douăzeci de
  centimetri", "o sută de kilometri" with "de" particle).

- **`/voice setvoice <M1-F5>` slash command** (discord_voice.py) — Discord
  native autocomplete; swaps the live TTSQueue voice_id AND persists
  voice.default_voice to config.json. No restart needed.

- **Verbal voice change** (src/voice/voice_commands.py — new module +
  29 tests) — say "schimbă vocea pe M5" / "vorbește cu vocea F3" / "voce
  em cinci" from inside the voice channel. Detector requires both a
  trigger word (voce/vorbește/schimbă/treci pe) and a recognizable voice
  ID (direct "M5", word form "em cinci", or fallback substring match for
  Whisper-mangled forms like "unul cinci"=M5 and "Mâcinci"=M5). On
  detection: live-swap, persist to config, mirror to chat with
  `🎤 ... / 🔊 Voce → M5`, speak short ack in the NEW voice, skip
  Claude. "pământinci" still can't be recovered (no recoverable digit
  substring); user gets passthrough to Claude in that case.

- **Whisper initial_prompt** now lists the voice-command vocabulary so
  STT biases toward producing clean "M5" / "F3" tokens instead of
  inventing "pământ" / "unul" phonetic neighbors.

- **Fast barge-in** (pipeline.py EchoVoiceSink) — previously `ttsq.clear()`
  only fired in `on_segment_done` (after 800ms silence + 2-3s STT ≈ 3s lag).
  Now also fires from the sink as soon as VAD detects ≥2 consecutive
  windows (~200ms) of sustained speech on Marius's user while Echo has
  pending TTS frames. Single-window glitches don't cut Echo off; sustained
  speech does. (Acoustic echo bleed-through still requires headphones —
  no AEC in the bot.)

- Tests: 130 voice + router tests pass; updated test_router.py to expect
  `/clear` to drop both text and voice session keys.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 20:59:10 +00:00
parent d1bc77e87d
commit e589e4885e
10 changed files with 412 additions and 20 deletions

View File

@@ -105,9 +105,11 @@
"url": "http://10.0.20.161:11434"
},
"voice": {
"allowed_user_ids": ["949388626146517022"],
"allowed_user_ids": [
"949388626146517022"
],
"user_name": "Marius",
"default_voice": "M2",
"default_voice": "M5",
"auto_leave_minutes": 5
},
"paths": {

30
personality/VOICE_MODE.md Normal file
View File

@@ -0,0 +1,30 @@
# Voice Mode
Răspunzi prin voce (TTS). Marius te aude — nu citește. Reguli care contează:
## Lungime și ton
- **Scurt**: 1-2 propoziții, max ~30 cuvinte per turn. Marius vorbește cu tine — nu redactezi un document.
- **Conversațional**: ca un om viu. Fără "Sigur, iată...", "Permite-mi să...", "Te rog să...". Direct la subiect.
- **Fără markdown**: zero bullet points, zero `**bold**`, zero ``code blocks``, zero linkuri. Totul e citit cu voce.
## Numere și unități
- **Ora**: fără secunde. Spune "ora 23 și 9 minute" sau "9 și jumătate", nu "23:09:42".
- **Distanțe mari**: rotunjește în "mii" sau "milioane". Pentru Pământ-Lună spune "384 mii de kilometri", nu "384.000 km".
- **Zecimale**: omite-le când nu adaugă informație. "5 lei" nu "5,00 lei". "două ore" nu "2,0 ore". "20 de minute" nu "20,5 minute".
- **Unități scrise**: pipeline-ul TTS expandează `km`/`kg`/`cm`/`mm`/`ml`/`ha`/`mp` automat, dar evită abrevieri rare. Scrie "metri" nu "m." dacă e ambiguu.
## Structură
- Listă scurtă verbală: "Trei lucruri: întâi X, apoi Y, plus Z."
- Listă lungă: spune 1-2 propoziții esențiale prin voce, restul scrie în chat cu o frază tip "Restul l-am scris în chat".
- Întrebări clarificatoare: pune UNA, nu trei.
## Punctuație
- Doar virgule și puncte. Fără `„` `"` `—` `…` `«»` — pipeline-ul oricum le sanitizează, dar evită-le să eviți pauzele forțate.
## Tu ești Marius's prieten în mașină
Imaginează-ți că Marius conduce și te-a întrebat ceva pe difuzor. Răspunzi natural, scurt, la subiect — fără ceremonii.

View File

@@ -246,6 +246,45 @@ def register(tree: app_commands.CommandTree, bot: discord.Client) -> app_command
log.warning("Presence reset skipped", exc_info=True)
await interaction.followup.send("Plecat.", ephemeral=True)
_VOICE_CHOICES = [
app_commands.Choice(name=v, value=v)
for v in ("M1", "M2", "M3", "M4", "M5", "F1", "F2", "F3", "F4", "F5")
]
@voice_group.command(name="setvoice", description="Schimbă vocea Echo (M1-M5 sau F1-F5)")
@app_commands.describe(voice="Voce nouă")
@app_commands.choices(voice=_VOICE_CHOICES)
async def setvoice(
interaction: discord.Interaction,
voice: app_commands.Choice[str],
) -> None:
await interaction.response.defer(ephemeral=True)
new_voice = voice.value
# Live-swap on the active session if Echo is in voice on this guild.
guild_id = interaction.guild.id if interaction.guild else None
session = _voice_sessions.get(guild_id) if guild_id is not None else None
live_swapped = False
if session is not None and session.ttsq is not None:
session.ttsq.voice_id = new_voice
live_swapped = True
# Persist as the new default for future sessions.
try:
cfg = Config()
cfg.set("voice.default_voice", new_voice)
cfg.save()
except Exception as e:
log.warning("config save failed for new default voice: %s", e)
await interaction.followup.send(
f"Voce schimbată live ({new_voice}), dar config-ul nu s-a salvat: {e}",
ephemeral=True,
)
return
if live_swapped:
msg = f"Vocea schimbată **live** pe {new_voice}. Următoarea frază va folosi vocea nouă."
else:
msg = f"Default voce setată {new_voice}. Va intra în vigoare la următorul /voice join."
await interaction.followup.send(msg, ephemeral=True)
@voice_group.command(name="doctor", description="Verifică voice stack")
async def doctor(interaction: discord.Interaction) -> None:
await interaction.response.defer(ephemeral=True)

View File

@@ -399,15 +399,23 @@ def _run_claude(
# ---------------------------------------------------------------------------
def build_system_prompt() -> str:
"""Concatenate personality/*.md files into a single system prompt."""
def build_system_prompt(voice_mode: bool = False) -> str:
"""Concatenate personality/*.md files into a single system prompt.
When ``voice_mode=True``, appends ``VOICE_MODE.md`` so the model knows
its reply will be read aloud (terse, no markdown, no abbreviations, etc.).
"""
if not PERSONALITY_DIR.is_dir():
raise FileNotFoundError(
f"Personality directory not found: {PERSONALITY_DIR}"
)
files = list(PERSONALITY_FILES)
if voice_mode:
files.append("VOICE_MODE.md")
parts: list[str] = []
for filename in PERSONALITY_FILES:
for filename in files:
filepath = PERSONALITY_DIR / filename
if filepath.is_file():
parts.append(filepath.read_text(encoding="utf-8"))
@@ -434,6 +442,7 @@ def start_session(
model: str = DEFAULT_MODEL,
timeout: int = DEFAULT_TIMEOUT,
on_text: Callable[[str], None] | None = None,
voice_mode: bool = False,
) -> tuple[str, str]:
"""Start a new Claude CLI session for a channel.
@@ -441,13 +450,16 @@ def start_session(
If *on_text* is provided, each intermediate Claude text block is passed
to the callback as soon as it arrives.
*voice_mode* — when True, ``VOICE_MODE.md`` is appended to the system
prompt so the model produces short, TTS-friendly responses.
"""
if model not in VALID_MODELS:
raise ValueError(
f"Invalid model '{model}'. Must be one of: haiku, sonnet, opus"
)
system_prompt = build_system_prompt()
system_prompt = build_system_prompt(voice_mode=voice_mode)
# Wrap external user message with injection protection markers
wrapped_message = f"[EXTERNAL CONTENT]\n{message}\n[END EXTERNAL CONTENT]"
@@ -578,6 +590,7 @@ def send_message(
model: str = DEFAULT_MODEL,
timeout: int = DEFAULT_TIMEOUT,
on_text: Callable[[str], None] | None = None,
voice_mode: bool = False,
) -> str:
"""High-level convenience: auto start or resume based on channel state.
@@ -598,7 +611,8 @@ def send_message(
if session is not None and session.get("model"):
effective_model = session["model"]
response_text, _session_id = start_session(
channel_id, message, effective_model, timeout, on_text=on_text
channel_id, message, effective_model, timeout,
on_text=on_text, voice_mode=voice_mode,
)
return response_text

View File

@@ -123,8 +123,10 @@ def route_message(
# Text-based commands (not slash commands — these work in any adapter)
if text.lower() == "/clear":
default_model = _get_config().get("bot.default_model", "sonnet")
cleared = clear_session(channel_id)
if cleared:
cleared_text = clear_session(channel_id)
# Also drop the isolated voice session if one exists on this channel.
clear_session(f"voice:{channel_id}")
if cleared_text:
return f"Session cleared. Model reset to {default_model}.", True
return "No active session.", True
@@ -159,12 +161,19 @@ def route_message(
# (Engineering decision #14 in the plan.) Only the discord-voice adapter
# triggers it — text adapters keep the message verbatim.
claude_text = text
if adapter_name == "discord-voice":
voice_mode = adapter_name == "discord-voice"
if voice_mode:
user_name = _get_config().get("voice.user_name", "user") or "user"
claude_text = f"[speaker:{user_name}] {text}"
# Voice sessions use an isolated session key so they start fresh with
# VOICE_MODE.md and don't pollute the text channel's conversation.
session_key = f"voice:{channel_id}" if voice_mode else channel_id
try:
response = send_message(channel_id, claude_text, model=model, on_text=on_text)
response = send_message(
session_key, claude_text, model=model, on_text=on_text,
voice_mode=voice_mode,
)
_set_last_response(channel_id, response)
return response, False
except Exception as e:

View File

@@ -94,6 +94,55 @@ def expand_numbers_ro(text: str) -> str:
return _NUM_TOKEN.sub(_sub, text)
# ---------- Thousands separator ----------
# Romanian uses dot or space as thousands separator: 384.000 / 384 000. The
# decimal expander would read "384.000" as "trei sute optzeci și patru virgulă
# zero zero zero" — wrong. Collapse the dots so expand_numbers_ro reads the
# whole integer. Only 1-3 leading digits followed by ≥1 group of exactly 3
# digits, never adjacent to other digits.
_THOUSANDS_DOT = re.compile(r'(?<!\d)(\d{1,3}(?:\.\d{3})+)(?!\d)')
def normalize_thousands(text: str) -> str:
"""Strip the dot from Romanian thousands-separator integers."""
return _THOUSANDS_DOT.sub(lambda m: m.group(1).replace('.', ''), text)
# ---------- Metric units ----------
# (regex_matching_<n><unit>, singular, plural). Matches an integer or decimal
# followed by the abbreviation as a whole word. Skipping bare ``m`` and ``l``
# because they collide with too many tokens ("M2" voice id, list markers).
_UNIT_PATTERNS: list[tuple[re.Pattern, str, str]] = [
(re.compile(r'(?<!\w)(\d+(?:[.,]\d+)?)\s*km\b', re.IGNORECASE), 'kilometru', 'kilometri'),
(re.compile(r'(?<!\w)(\d+(?:[.,]\d+)?)\s*kg\b', re.IGNORECASE), 'kilogram', 'kilograme'),
(re.compile(r'(?<!\w)(\d+(?:[.,]\d+)?)\s*cm\b', re.IGNORECASE), 'centimetru', 'centimetri'),
(re.compile(r'(?<!\w)(\d+(?:[.,]\d+)?)\s*mm\b', re.IGNORECASE), 'milimetru', 'milimetri'),
(re.compile(r'(?<!\w)(\d+(?:[.,]\d+)?)\s*ml\b', re.IGNORECASE), 'mililitru', 'mililitri'),
(re.compile(r'(?<!\w)(\d+(?:[.,]\d+)?)\s*ha\b', re.IGNORECASE), 'hectar', 'hectare'),
(re.compile(r'(?<!\w)(\d+(?:[.,]\d+)?)\s*mp\b', re.IGNORECASE), 'metru pătrat', 'metri pătrați'),
]
def _format_unit(amount_str: str, singular: str, plural: str) -> str:
"""Mirror ``_format_currency_unit`` for metric units. Decimals fall through
to the generic decimal expander (which leaves them with plural form)."""
if '.' in amount_str or ',' in amount_str:
return f"{_decimal_to_ro(amount_str.replace(',', '.'))} {plural}"
return _format_currency_unit(int(amount_str), singular, plural)
def expand_units(text: str) -> str:
"""Expand metric unit abbreviations into spoken Romanian."""
for pattern, singular, plural in _UNIT_PATTERNS:
text = pattern.sub(
lambda m, sg=singular, pl=plural: _format_unit(m.group(1), sg, pl),
text,
)
return text
# ---------- Time ----------
_TIME_PATTERN = re.compile(r'(?<!\d)([01]?\d|2[0-3]):([0-5]?\d)(?!\d)')
@@ -257,8 +306,10 @@ def normalize_for_tts(text: str) -> str:
text = strip_markdown(text)
text = sanitize_punctuation(text)
text = expand_abbreviations(text)
text = normalize_thousands(text)
text = expand_time(text)
text = expand_currency(text)
text = expand_units(text)
text = expand_numbers_ro(text)
text = expand_symbols(text)
words = text.split()

View File

@@ -34,6 +34,7 @@ from typing import Any, Callable, Optional
import numpy as np
from src.voice._discord_voice_adapter import AudioSink, VoiceData
from src.voice.voice_commands import detect_voice_change
log = logging.getLogger(__name__)
@@ -274,6 +275,12 @@ class VoiceSession:
except Exception as e: # noqa: BLE001
log.warning("ttsq.clear failed: %s", e)
# In-band voice command: change TTS voice without round-tripping Claude.
new_voice = detect_voice_change(text)
if new_voice is not None:
await self._handle_voice_change(speaker_name, text, new_voice)
return
# 1. Mirror to text channel (one Unicode 🎤 — exception per plan).
if self.mirror_enabled and self.text_channel is not None:
try:
@@ -327,6 +334,45 @@ class VoiceSession:
except Exception as e: # noqa: BLE001
log.error("route_message voice path failed: %s", e)
async def _handle_voice_change(
self, speaker_name: str, original_text: str, new_voice: str,
) -> None:
"""Apply an in-band 'change voice' command: swap live, persist to
config, mirror to chat, speak a short acknowledgment in the new voice.
Does NOT forward the utterance to Claude."""
# 1. Live-swap on the TTS queue. Next clause synth uses the new voice.
try:
self.ttsq.voice_id = new_voice
except Exception as e: # noqa: BLE001
log.warning("ttsq voice swap failed: %s", e)
# 2. Persist as the new default for future sessions.
try:
from src.config import Config
cfg = Config()
cfg.set("voice.default_voice", new_voice)
cfg.save()
except Exception as e: # noqa: BLE001
log.warning("voice default persist failed: %s", e)
# 3. Mirror what was heard + show the swap in the text channel.
if self.mirror_enabled and self.text_channel is not None:
try:
send = getattr(self.text_channel, "send", None)
if callable(send):
coro = send(
f"\U0001f3a4 {speaker_name}: \"{original_text}\"\n"
f"\U0001f50a Voce → **{new_voice}**"
)
if asyncio.iscoroutine(coro):
await coro
except Exception as e: # noqa: BLE001
log.warning("voice mirror send failed: %s", e)
# 4. Verbal acknowledgment in the NEW voice.
try:
self.ttsq.push_text(f"Vocea {new_voice}.")
except Exception as e: # noqa: BLE001
log.warning("voice ack push failed: %s", e)
self._log_metric({"event": "voice_change", "new_voice": new_voice})
# ----- helpers -----
def _resolve_speaker_name(self, speaker_id: int) -> str:
@@ -381,6 +427,10 @@ class EchoVoiceSink(AudioSink):
# chain breaks when "I spoke but Echo heard nothing" happens.
self._first_packet_logged: set[int] = set()
self._first_speech_logged: set[int] = set()
# Track consecutive VAD-positive windows per user. Used to delay
# barge-in (don't cut Echo off on a single jittery VAD hit; require
# ≥2 windows ≈ 200ms of sustained speech).
self._vad_consecutive: dict[int, int] = {}
# Background poller that triggers the silence flush even when Discord
# DTX stops delivering RTP packets after the user stops speaking. Without
# this, sink.write would stop firing and STT would never run on the
@@ -444,9 +494,27 @@ class EchoVoiceSink(AudioSink):
if uid not in self._first_speech_logged:
self._first_speech_logged.add(uid)
log.info("voice sink: VAD detected speech from user %s", uid)
self._vad_consecutive[uid] = self._vad_consecutive.get(uid, 0) + 1
with self._sink_lock:
self._last_speech_ts[uid] = time.monotonic()
self._has_speech[uid] = True
# Fast barge-in: after ≥2 consecutive VAD windows (~200ms
# of sustained speech), cut Echo's TTS mid-sentence so the
# user doesn't have to wait the full silence-flush + STT
# cycle (~3s).
if self._vad_consecutive[uid] >= 2:
try:
ttsq = self.session.ttsq
if ttsq is not None and not ttsq.is_empty():
ttsq.clear()
log.info(
"voice sink: barge-in cleared TTS queue (user=%s)",
uid,
)
except Exception as e: # noqa: BLE001
log.warning("barge-in clear failed: %s", e)
else:
self._vad_consecutive[uid] = 0
pcm_for_stt = self._take_flushable_pcm(uid)
if pcm_for_stt:
@@ -530,7 +598,10 @@ class EchoVoiceSink(AudioSink):
mono16, language="ro", beam_size=5,
initial_prompt=(
"Echo Core, asistent personal AI românesc al lui Marius. "
"Conversație colocvială în română."
"Conversație colocvială în română. "
"Comenzi voce recunoscute: schimbă vocea pe M1, M2, M3, M4, M5, "
"F1, F2, F3, F4, F5. Exemple: vorbește cu vocea M5, voce F3, "
"treci pe vocea F1."
),
condition_on_previous_text=False,
)

118
src/voice/voice_commands.py Normal file
View File

@@ -0,0 +1,118 @@
"""Detect in-band voice commands from STT transcripts.
The voice pipeline transcribes Marius's speech via Whisper and dispatches the
text to Claude. Some utterances are not questions for Claude — they're
control commands for the voice stack itself. This module parses those out
*before* the Claude round-trip so they take effect instantly and don't waste
a Claude session turn.
Currently handled:
* change TTS voice — "schimbă vocea pe M5", "vorbește cu vocea F3",
"voce em cinci", "voce feminină 3", etc.
The parser is intentionally conservative: it requires BOTH a voice trigger
word ("voce", "vorbește", "schimbă", "treci pe") AND a recognizable voice
ID. A bare "M5" without context is NOT a command — Marius might be quoting
a string.
"""
from __future__ import annotations
import re
from typing import Optional
_VALID_VOICES = {f"M{i}" for i in range(1, 6)} | {f"F{i}" for i in range(1, 6)}
# Trigger words that suggest the user is talking ABOUT the voice, not just
# saying something that happens to contain a voice-ID-looking substring.
_VOICE_TRIGGER_RE = re.compile(
r'\b(voce|vocea|voci|voice|vorbe[șs]te|schimb[aăÎ]|treci\s+pe)\b',
re.IGNORECASE,
)
# Direct form: "M5", "F 3", "m5", etc.
_VOICE_ID_DIRECT_RE = re.compile(
r'\b([MF])\s*([1-5])\b',
re.IGNORECASE,
)
# Word form: "em cinci", "M trei", "masculin doi", "feminină patru", etc.
# Whisper often transcribes "M5" as "em cinci" / "M cinci" because letter
# names are spelled out phonetically in Romanian.
_VOICE_ID_WORDS_RE = re.compile(
r'\b(em|m|masculin[aăe]?|ef|f|feminin[aăe]?)\s+(unu|una|doi|dou[ăa]|trei|patru|cinci|[1-5])\b',
re.IGNORECASE,
)
_DIGIT_WORD_TO_INT = {
'unu': 1, 'una': 1, 'unul': 1, '1': 1,
'doi': 2, 'două': 2, 'doua': 2, '2': 2,
'trei': 3, '3': 3,
'patru': 4, '4': 4,
'cinci': 5, '5': 5,
}
# Substring fallback: matches digit roots even when Whisper glues them into
# compound non-words like "Mâcinci" (for "M cinci"=M5).
_DIGIT_SUBSTR_RE = re.compile(
r'(cinci|patru|trei|dou[ăa]|unul|unu|una)',
re.IGNORECASE,
)
_F_GENDER_HINT_RE = re.compile(r'feminin|\bef\b|\bF\d?\b', re.IGNORECASE)
def _normalize_gender(word: str) -> Optional[str]:
"""Map gender word to 'M' or 'F'."""
w = word.lower()
if w in ('m', 'em') or w.startswith('masculin'):
return 'M'
if w in ('f', 'ef') or w.startswith('feminin'):
return 'F'
return None
def detect_voice_change(text: str) -> Optional[str]:
"""Parse a transcript for a 'change voice' command.
Returns the target voice id (one of M1-M5, F1-F5) or None if no command
was detected. Requires both a voice trigger word and a voice ID.
"""
if not text:
return None
if not _VOICE_TRIGGER_RE.search(text):
return None
# Try the direct form first (M5, F3, etc.)
m = _VOICE_ID_DIRECT_RE.search(text)
if m:
candidate = f"{m.group(1).upper()}{m.group(2)}"
if candidate in _VALID_VOICES:
return candidate
# Fall back to the word form ("em cinci", "feminin trei", ...).
m = _VOICE_ID_WORDS_RE.search(text)
if m:
gender = _normalize_gender(m.group(1))
digit = _DIGIT_WORD_TO_INT.get(m.group(2).lower())
if gender is not None and digit is not None:
candidate = f"{gender}{digit}"
if candidate in _VALID_VOICES:
return candidate
# Permissive fallback: Whisper sometimes glues the letter into the next
# word ("Mâcinci" for "M cinci") or replaces it ("unul cinci" for
# "M unu cinci"). After a voice trigger word, scan for any digit-word
# substring and infer gender (F if a feminine marker is present, else M).
digit_hits = _DIGIT_SUBSTR_RE.findall(text)
digits = [_DIGIT_WORD_TO_INT[d.lower()] for d in digit_hits
if d.lower() in _DIGIT_WORD_TO_INT]
digits = [d for d in digits if 1 <= d <= 5]
if digits:
gender = 'F' if _F_GENDER_HINT_RE.search(text) else 'M'
# Last digit wins — handles "M unu cinci" → M5 since "unu" is a
# mangled letter-name prefix, "cinci" is the actual target.
return f"{gender}{digits[-1]}"
return None
__all__ = ["detect_voice_change"]

View File

@@ -30,7 +30,10 @@ class TestClearCommand:
response, is_cmd = route_message("ch-1", "user-1", "/clear")
assert response == "Session cleared. Model reset to sonnet."
assert is_cmd is True
mock_clear.assert_called_once_with("ch-1")
# /clear drops both the text-adapter session and the isolated voice
# session for the same Discord channel.
mock_clear.assert_any_call("ch-1")
mock_clear.assert_any_call("voice:ch-1")
@patch("src.router._get_config")
@patch("src.router.clear_session")
@@ -191,7 +194,7 @@ class TestRegularMessage:
response, is_cmd = route_message("ch-1", "user-1", "hello")
assert response == "Hello from Claude!"
assert is_cmd is False
mock_send.assert_called_once_with("ch-1", "hello", model="sonnet", on_text=None)
mock_send.assert_called_once_with("ch-1", "hello", model="sonnet", on_text=None, voice_mode=False)
@patch("src.router.send_message")
def test_model_override(self, mock_send):
@@ -199,7 +202,7 @@ class TestRegularMessage:
response, is_cmd = route_message("ch-1", "user-1", "hello", model="opus")
assert response == "Response"
assert is_cmd is False
mock_send.assert_called_once_with("ch-1", "hello", model="opus", on_text=None)
mock_send.assert_called_once_with("ch-1", "hello", model="opus", on_text=None, voice_mode=False)
@patch("src.router._get_channel_config")
@patch("src.router._get_config")
@@ -227,7 +230,7 @@ class TestRegularMessage:
cb = lambda t: None
route_message("ch-1", "user-1", "hello", on_text=cb)
mock_send.assert_called_once_with("ch-1", "hello", model="sonnet", on_text=cb)
mock_send.assert_called_once_with("ch-1", "hello", model="sonnet", on_text=cb, voice_mode=False)
# --- _get_channel_config ---
@@ -269,7 +272,7 @@ class TestModelResolution:
mock_chan_cfg.return_value = {"id": "ch-1", "default_model": "haiku"}
route_message("ch-1", "user-1", "hello")
mock_send.assert_called_once_with("ch-1", "hello", model="haiku", on_text=None)
mock_send.assert_called_once_with("ch-1", "hello", model="haiku", on_text=None, voice_mode=False)
@patch("src.router._get_channel_config")
@patch("src.router._get_config")
@@ -283,7 +286,7 @@ class TestModelResolution:
mock_get_config.return_value = mock_cfg
route_message("ch-1", "user-1", "hello")
mock_send.assert_called_once_with("ch-1", "hello", model="opus", on_text=None)
mock_send.assert_called_once_with("ch-1", "hello", model="opus", on_text=None, voice_mode=False)
@patch("src.router._get_channel_config")
@patch("src.router._get_config")
@@ -297,7 +300,7 @@ class TestModelResolution:
mock_get_config.return_value = mock_cfg
route_message("ch-1", "user-1", "hello")
mock_send.assert_called_once_with("ch-1", "hello", model="sonnet", on_text=None)
mock_send.assert_called_once_with("ch-1", "hello", model="sonnet", on_text=None, voice_mode=False)
@patch("src.router.get_active_session")
@patch("src.router.send_message")
@@ -307,4 +310,4 @@ class TestModelResolution:
mock_get_session.return_value = {"model": "opus", "session_id": "abc"}
route_message("ch-1", "user-1", "hello")
mock_send.assert_called_once_with("ch-1", "hello", model="opus", on_text=None)
mock_send.assert_called_once_with("ch-1", "hello", model="opus", on_text=None, voice_mode=False)

View File

@@ -0,0 +1,55 @@
"""Tests for src/voice/voice_commands.detect_voice_change."""
from __future__ import annotations
import pytest
from src.voice.voice_commands import detect_voice_change
class TestDetectVoiceChange:
# --- positive cases (direct form) ---
@pytest.mark.parametrize("text,expected", [
("schimbă vocea pe M5", "M5"),
("Schimbă vocea pe F3.", "F3"),
("vorbește cu vocea M1", "M1"),
("vorbește cu vocea F2", "F2"),
("voce M4", "M4"),
("Voce F5.", "F5"),
("treci pe vocea F1", "F1"),
("Echo, treci pe M2.", "M2"),
("voice M3", "M3"),
])
def test_direct_form(self, text, expected):
assert detect_voice_change(text) == expected
# --- positive cases (word form, what Whisper actually produces) ---
@pytest.mark.parametrize("text,expected", [
("schimbă vocea pe em cinci", "M5"),
("vorbește cu vocea em trei", "M3"),
("voce em unu", "M1"),
("schimbă vocea pe ef doi", "F2"),
("voce ef cinci", "F5"),
("vorbește cu vocea masculină cinci", "M5"),
("schimbă vocea pe feminină trei", "F3"),
("voce masculin patru", "M4"),
("schimbă vocea pe M cinci", "M5"),
("voce F două", "F2"),
])
def test_word_form(self, text, expected):
assert detect_voice_change(text) == expected
# --- negative cases ---
@pytest.mark.parametrize("text", [
"",
"cât este ora",
"M5", # no trigger word
"Salut Echo, sunt în M3", # M3 here is a location/etc, no trigger
"vocea ta este foarte bună", # trigger but no voice id
"schimbă te rog", # trigger but no id
"voce M6", # out of range
"voce M0", # out of range
"voce F8", # out of range
"schimbă vocea pe șapte", # digit out of range
])
def test_no_match(self, text):
assert detect_voice_change(text) is None