Files
echo-core/tests/test_voice_commands.py
Marius Mutu 4be70440e8 feat(voice): DAVE E2E + full voice UX (squash of voice/dave-recv)
Squashed branch: voice/dave-recv → master. Closes Pas 12 (DAVE E2E) and lands
voice-mode UX polish + verbal voice control on top of the Pas 1-10 scaffolding
already on master.

## DAVE E2E receive-side decrypt (e4f3177)

Vendored fork: discord-ext-voice-recv 0.5.3a+echo.dave1. Patches the receive
pipeline to handle Discord's mandatory DAVE encryption on voice gateway v=8.
- `_maybe_dave_decrypt`: uses davey.can_passthrough(user_id) as primary gate,
  falls through to dave.decrypt for DAVE-epoch peers, drops on decrypt failure
  without killing the reader thread.
- VAD fix: silero-vad v5+ requires exactly 512 samples; our 100ms window
  (1600 samples) was silently raising ValueError → STT never fired. Now slice
  into 512-sample chunks.
- Whisper: bumped beam_size 1→5 and added RO initial_prompt.
- Tests: 11 DAVE unit tests + 2 callback integration tests + contract test
  with fork-version guard.

## Voice UX polish (d1bc77e)

- Killed the 3s "mă gândesc" filler (always collided with Claude p50 4-7s).
- Barge-in via `ttsq.clear()` at top of `on_segment_done`.
- DTX silence-flush poller (200ms tick) — Discord stops sending RTP packets
  when silent, so the inline silence-check in sink.write() never fired for
  trailing audio; background thread handles it.
- `EchoStreamingAudioSource.read()` non-blocking — old `get_frame(timeout=0.1)`
  wrecked Discord's 20ms cadence and the client interpreted bursts as
  stuttering (Marius heard "4 de minute" instead of full sentence).
- RO time expansion: 23:09 → "douăzeci și trei și nouă minute".
- Supertonic Unicode sanitize centralized in tools/tts.py.
- Whisper local_files_only=True — no HF metadata GET on each startup.
- Diagnostic logging through sink → VAD → Claude stream → TTS chain.

## Voice mode iteration (e589e48)

- `personality/VOICE_MODE.md` — voice-tailored system prompt (short, no
  markdown, no abbreviations, time without seconds, distances in
  "mii"/"milioane"); plumbed via build_system_prompt(voice_mode=True).
- Isolated voice session key `voice:<channel_id>` — voice doesn't share
  context with text adapter on the same channel; auto-applied without
  /clear ceremony. /clear drops both keys.
- Metric units + Romanian thousands (normalize.py): "384.000 km" →
  "trei sute optzeci și patru de mii de kilometri" with feminine-correct
  pluralization and "de" particle for ≥20.
- `/voice setvoice <M1-F5>` slash command with native autocomplete; swaps
  live + persists voice.default_voice to config.json.
- Verbal voice change (src/voice/voice_commands.py + 29 tests) — "schimbă
  vocea pe M5", "voce em cinci", with permissive substring fallback for
  Whisper-mangled forms like "Mâcinci"=M5 and "unul cinci"=M5. Whisper
  initial_prompt now lists voice vocabulary to bias STT toward clean
  outputs.
- Fast barge-in: VAD ≥2 consecutive windows (~200ms) on Marius's user
  while Echo has pending TTS frames → cut him off mid-sentence so user
  doesn't wait the full silence + STT cycle. Acoustic echo bleed-through
  still requires headphones (no AEC).

## Test suite

130 voice + router tests pass (test_voice_recv_dave, test_voice_session_cleanup,
test_voice_adapter_contract, test_voice_normalize, test_voice_commands,
test_router).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 21:00:27 +00:00

56 lines
2.0 KiB
Python

"""Tests for src/voice/voice_commands.detect_voice_change."""
from __future__ import annotations
import pytest
from src.voice.voice_commands import detect_voice_change
class TestDetectVoiceChange:
# --- positive cases (direct form) ---
@pytest.mark.parametrize("text,expected", [
("schimbă vocea pe M5", "M5"),
("Schimbă vocea pe F3.", "F3"),
("vorbește cu vocea M1", "M1"),
("vorbește cu vocea F2", "F2"),
("voce M4", "M4"),
("Voce F5.", "F5"),
("treci pe vocea F1", "F1"),
("Echo, treci pe M2.", "M2"),
("voice M3", "M3"),
])
def test_direct_form(self, text, expected):
assert detect_voice_change(text) == expected
# --- positive cases (word form, what Whisper actually produces) ---
@pytest.mark.parametrize("text,expected", [
("schimbă vocea pe em cinci", "M5"),
("vorbește cu vocea em trei", "M3"),
("voce em unu", "M1"),
("schimbă vocea pe ef doi", "F2"),
("voce ef cinci", "F5"),
("vorbește cu vocea masculină cinci", "M5"),
("schimbă vocea pe feminină trei", "F3"),
("voce masculin patru", "M4"),
("schimbă vocea pe M cinci", "M5"),
("voce F două", "F2"),
])
def test_word_form(self, text, expected):
assert detect_voice_change(text) == expected
# --- negative cases ---
@pytest.mark.parametrize("text", [
"",
"cât este ora",
"M5", # no trigger word
"Salut Echo, sunt în M3", # M3 here is a location/etc, no trigger
"vocea ta este foarte bună", # trigger but no voice id
"schimbă te rog", # trigger but no id
"voce M6", # out of range
"voce M0", # out of range
"voce F8", # out of range
"schimbă vocea pe șapte", # digit out of range
])
def test_no_match(self, text):
assert detect_voice_change(text) is None