feat(voice): voice-mode prompt, isolated session, units, verbal voice swap, fast barge-in

Second voice UX iteration. Targets Marius's live-test pain points from today.

- **Voice-mode system prompt** (personality/VOICE_MODE.md, plumbed via
  claude_session.build_system_prompt(voice_mode=True)) — when the voice
  adapter starts a session, append voice-tailored instructions: short replies,
  no markdown, no abbreviations, time without seconds, distances rounded
  to "mii"/"milioane", no curly quotes / em-dash / ellipsis. Marius asked
  for a "in-the-car friend" persona for voice.

- **Isolated voice session key** (router.py) — voice mode uses
  `voice:<channel_id>` so it doesn't share context with the text adapter
  on the same Discord channel. Fresh start, voice prompt applied
  automatically without `/clear` ceremony. `/clear` drops both keys.

- **Metric units + Romanian thousands** (src/voice/normalize.py) —
  `384.000 km` was being read as "trei sute optzeci și patru virgulă zero
  zero zero km" because the dot was treated as decimal separator and `km`
  wasn't expanded. New `normalize_thousands` collapses Romanian thousands
  separators (`X.000`/`X.000.000`) before number expansion, and
  `expand_units` handles km/kg/cm/mm/ml/ha/mp with correct Romanian
  pluralization ("un kilometru", "două kilograme", "douăzeci de
  centimetri", "o sută de kilometri" with "de" particle).

- **`/voice setvoice <M1-F5>` slash command** (discord_voice.py) — Discord
  native autocomplete; swaps the live TTSQueue voice_id AND persists
  voice.default_voice to config.json. No restart needed.

- **Verbal voice change** (src/voice/voice_commands.py — new module +
  29 tests) — say "schimbă vocea pe M5" / "vorbește cu vocea F3" / "voce
  em cinci" from inside the voice channel. Detector requires both a
  trigger word (voce/vorbește/schimbă/treci pe) and a recognizable voice
  ID (direct "M5", word form "em cinci", or fallback substring match for
  Whisper-mangled forms like "unul cinci"=M5 and "Mâcinci"=M5). On
  detection: live-swap, persist to config, mirror to chat with
  `🎤 ... / 🔊 Voce → M5`, speak short ack in the NEW voice, skip
  Claude. "pământinci" still can't be recovered (no recoverable digit
  substring); user gets passthrough to Claude in that case.

- **Whisper initial_prompt** now lists the voice-command vocabulary so
  STT biases toward producing clean "M5" / "F3" tokens instead of
  inventing "pământ" / "unul" phonetic neighbors.

- **Fast barge-in** (pipeline.py EchoVoiceSink) — previously `ttsq.clear()`
  only fired in `on_segment_done` (after 800ms silence + 2-3s STT ≈ 3s lag).
  Now also fires from the sink as soon as VAD detects ≥2 consecutive
  windows (~200ms) of sustained speech on Marius's user while Echo has
  pending TTS frames. Single-window glitches don't cut Echo off; sustained
  speech does. (Acoustic echo bleed-through still requires headphones —
  no AEC in the bot.)

- Tests: 130 voice + router tests pass; updated test_router.py to expect
  `/clear` to drop both text and voice session keys.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 20:59:10 +00:00
parent d1bc77e87d
commit e589e4885e
10 changed files with 412 additions and 20 deletions

View File

@@ -0,0 +1,55 @@
"""Tests for src/voice/voice_commands.detect_voice_change."""
from __future__ import annotations
import pytest
from src.voice.voice_commands import detect_voice_change
class TestDetectVoiceChange:
# --- positive cases (direct form) ---
@pytest.mark.parametrize("text,expected", [
("schimbă vocea pe M5", "M5"),
("Schimbă vocea pe F3.", "F3"),
("vorbește cu vocea M1", "M1"),
("vorbește cu vocea F2", "F2"),
("voce M4", "M4"),
("Voce F5.", "F5"),
("treci pe vocea F1", "F1"),
("Echo, treci pe M2.", "M2"),
("voice M3", "M3"),
])
def test_direct_form(self, text, expected):
assert detect_voice_change(text) == expected
# --- positive cases (word form, what Whisper actually produces) ---
@pytest.mark.parametrize("text,expected", [
("schimbă vocea pe em cinci", "M5"),
("vorbește cu vocea em trei", "M3"),
("voce em unu", "M1"),
("schimbă vocea pe ef doi", "F2"),
("voce ef cinci", "F5"),
("vorbește cu vocea masculină cinci", "M5"),
("schimbă vocea pe feminină trei", "F3"),
("voce masculin patru", "M4"),
("schimbă vocea pe M cinci", "M5"),
("voce F două", "F2"),
])
def test_word_form(self, text, expected):
assert detect_voice_change(text) == expected
# --- negative cases ---
@pytest.mark.parametrize("text", [
"",
"cât este ora",
"M5", # no trigger word
"Salut Echo, sunt în M3", # M3 here is a location/etc, no trigger
"vocea ta este foarte bună", # trigger but no voice id
"schimbă te rog", # trigger but no id
"voce M6", # out of range
"voce M0", # out of range
"voce F8", # out of range
"schimbă vocea pe șapte", # digit out of range
])
def test_no_match(self, text):
assert detect_voice_change(text) is None