echo-core

Author	SHA1	Message	Date
Marius Mutu	4be70440e8	feat(voice): DAVE E2E + full voice UX (squash of voice/dave-recv) Squashed branch: voice/dave-recv → master. Closes Pas 12 (DAVE E2E) and lands voice-mode UX polish + verbal voice control on top of the Pas 1-10 scaffolding already on master. ## DAVE E2E receive-side decrypt (`e4f3177`) Vendored fork: discord-ext-voice-recv 0.5.3a+echo.dave1. Patches the receive pipeline to handle Discord's mandatory DAVE encryption on voice gateway v=8. - `_maybe_dave_decrypt`: uses davey.can_passthrough(user_id) as primary gate, falls through to dave.decrypt for DAVE-epoch peers, drops on decrypt failure without killing the reader thread. - VAD fix: silero-vad v5+ requires exactly 512 samples; our 100ms window (1600 samples) was silently raising ValueError → STT never fired. Now slice into 512-sample chunks. - Whisper: bumped beam_size 1→5 and added RO initial_prompt. - Tests: 11 DAVE unit tests + 2 callback integration tests + contract test with fork-version guard. ## Voice UX polish (`d1bc77e`) - Killed the 3s "mă gândesc" filler (always collided with Claude p50 4-7s). - Barge-in via `ttsq.clear()` at top of `on_segment_done`. - DTX silence-flush poller (200ms tick) — Discord stops sending RTP packets when silent, so the inline silence-check in sink.write() never fired for trailing audio; background thread handles it. - `EchoStreamingAudioSource.read()` non-blocking — old `get_frame(timeout=0.1)` wrecked Discord's 20ms cadence and the client interpreted bursts as stuttering (Marius heard "4 de minute" instead of full sentence). - RO time expansion: 23:09 → "douăzeci și trei și nouă minute". - Supertonic Unicode sanitize centralized in tools/tts.py. - Whisper local_files_only=True — no HF metadata GET on each startup. - Diagnostic logging through sink → VAD → Claude stream → TTS chain. ## Voice mode iteration (`e589e48`) - `personality/VOICE_MODE.md` — voice-tailored system prompt (short, no markdown, no abbreviations, time without seconds, distances in "mii"/"milioane"); plumbed via build_system_prompt(voice_mode=True). - Isolated voice session key `voice:<channel_id>` — voice doesn't share context with text adapter on the same channel; auto-applied without /clear ceremony. /clear drops both keys. - Metric units + Romanian thousands (normalize.py): "384.000 km" → "trei sute optzeci și patru de mii de kilometri" with feminine-correct pluralization and "de" particle for ≥20. - `/voice setvoice <M1-F5>` slash command with native autocomplete; swaps live + persists voice.default_voice to config.json. - Verbal voice change (src/voice/voice_commands.py + 29 tests) — "schimbă vocea pe M5", "voce em cinci", with permissive substring fallback for Whisper-mangled forms like "Mâcinci"=M5 and "unul cinci"=M5. Whisper initial_prompt now lists voice vocabulary to bias STT toward clean outputs. - Fast barge-in: VAD ≥2 consecutive windows (~200ms) on Marius's user while Echo has pending TTS frames → cut him off mid-sentence so user doesn't wait the full silence + STT cycle. Acoustic echo bleed-through still requires headphones (no AEC). ## Test suite 130 voice + router tests pass (test_voice_recv_dave, test_voice_session_cleanup, test_voice_adapter_contract, test_voice_normalize, test_voice_commands, test_router). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 21:00:27 +00:00
Marius Mutu	23666f7910	feat(voice): Pas 5 — voice/pipeline.py VoiceSession + EchoVoiceSink + cleanup Central voice pipeline (~250 LOC + docstrings = ~430 lines): VoiceSession (context manager + idempotent cleanup pe 5 căi): - __enter__: acquire _lock, open JSONL (record=on) - __exit__: calls cleanup("exit"), nu suprimă exceptions - cleanup(reason): IDEMPOTENT, side effects o singură dată — JSONL flush+close (record=on) sau delete (record=off), bot presence cleared, voice_client.cleanup(), ttsq.stop(), cancel filler task, lock release, structured log la logs/voice_metrics.jsonl - on_segment_done(speaker_id, text, no_speech_prob): mirror text channel, append JSONL, arm 3s filler timer, route_message cu on_text callback + cancel filler la first block - last_activity_ts: time.monotonic() — caller-driven 5min auto-leave EchoVoiceSink(session, bot_user_id): - wants_opus() False (PCM) - write() runs în voice_recv reader thread (threading primitives only): - GUARD 1: user None/id==0/id==bot_user_id → return (load-bearing echo prevention) - GUARD 2: whitelist filter (empty = allow all) - Buffer 20ms packets per-user → batch 100ms (5×20ms = 19200 bytes) → silero-vad threshold 0.5 → 800ms cumulative silence flush - _flush_to_stt: faster-whisper small int8 cpu_threads=4 lang=ro beam_size=1, no_speech_prob > 0.6 drop, schedule on_segment_done via run_coroutine_threadsafe pe session.loop Module helpers (lazy thread-safe singletons): _get_whisper_model, _get_silero_vad. Constants: FILLER_DELAY_S=3.0, SILENCE_FLUSH_MS=800, VAD_THRESHOLD=0.5, VAD_WINDOW_MS=100, NO_SPEECH_DROP_THRESHOLD=0.6. Decisions: - STT runs in audio thread — acceptable la 2.25s p50 (user just stopped talking, no batching contention). Wrap în ThreadPoolExecutor.submit if perf bites later. - Downsample 48k→16k via 3-sample averaging (no scipy dep). Whisper robust la mild aliasing. - Energy-RMS VAD fallback dacă torch import fail — graceful degrade. - router_route_message injection seam ca kwarg pentru testabilitate. - bot.change_presence handling cross-thread via run_coroutine_threadsafe. tests/test_voice_session_cleanup.py — 6 tests: - voice_leave / disconnect / crash via __exit__ / auto_leave / user_left_channel (5 cleanup paths each verified for: JSONL state, presence cleared, voice_client.cleanup, ttsq.stop, lock release, idempotency) - 1 robustness cross-cut (double-cleanup safety) 6/6 PASS. Regression suite 63/63 PASS (normalize + adapter + mutex). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 14:55:57 +00:00
Marius Mutu	217da65417	feat(voice): Pas 6 — voice/tts_stream.py streaming TTS pipeline src/voice/tts_stream.py (~280 lines): - clause_segments(text, min_words=8): yield Romanian-aware clause chunks. Split la punct (./!/?;:,) cu accumulation până min_words satisfied; edge case text < min_words → single chunk. NU split mid-word/number/ currency. Romanian intonație de frază se rupe la sentence break — 8+ words minimizează seams. - TTSQueue worker thread: text queue in → PCM frames out. Methods: start/stop/push_text/push_filler/clear/is_empty. normalize_for_tts() apply first, then clause_segments(), then Supertonic synth per chunk. - EchoStreamingAudioSource(discord.AudioSource): read() pull from PCM queue, 20ms frames (3840 bytes 48kHz s16le stereo). Eliminates RTP gap between play() calls — single play() per session, source pulls. - load_thinking_wav(): one-shot cache → 140 × 20ms frames (~2.8s) pentru filler "Stai puțin să-mi adun gândurile". - wav_to_pcm_20ms_frames(): WAV parser + ffmpeg subprocess resample la 48kHz s16le stereo dacă nevoie. Smoke test (în session): clause_segments behaviour OK, thinking.wav loads, TTSQueue + EchoStreamingAudioSource construct clean. Integration testing deferred la convergență (Pas 7 + Pas 11). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 14:44:13 +00:00
Marius Mutu	a3eefbc799	feat(voice): Pas 4 — _discord_voice_adapter.py thin layer + contract test Adapter layer peste vendored discord-ext-voice-recv. Re-exports: VoiceReceiveClient, AudioSink, VoiceData, plus async helper connect_voice(channel). Discord voice protocol e fragil, upstream e hobby fork — dacă pică, swap la py-cord = doar acest fișier rescris. Contract test (22 assertions) prinde drift la upgrade vendor: - VoiceReceiveClient methods: connect/disconnect/listen/stop_listening/ is_listening/stop/cleanup - listen(sink, *, after=None) signature - sink property (read/write) - AudioSink methods: write/cleanup/wants_opus + write(self, user, data) arity - VoiceData slots (packet/source/pcm) + .opus property Critical pentru Lane PIPE downstream: write() e called from audio thread (NOT asyncio loop) — threading primitives mandatory pentru EchoVoiceSink. 22/22 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 14:42:50 +00:00
Marius Mutu	a48562b2f5	feat(voice): Pas 3 — voice/normalize.py + 35 RO test cases Pure functions pentru TTS text normalization (RO): - strip_markdown: regex bold/italic/code/link/heading/list - expand_numbers_ro: num2words pentru cardinals + decimal handling ("3.14" → "trei virgulă paisprezece", "3.05" → "trei virgulă zero cinci" digit-by-digit la leading zero) - expand_currency: formă naturală RO ("12.50 RON" → "doisprezece lei și cincizeci de bani", "$25.99" → "douăzeci și cinci de dolari și nouăzeci și nouă de cenți") - expand_symbols: %/&/@/° + whitespace collapse - expand_abbreviations: etc./dl./dna./nr./ş.a./ş.a.m.d. - normalize_for_tts: full pipeline + hard truncate 200 cuvinte cu "Restul l-am scris în chat." Pipeline order: markdown → abbreviations → currency → numbers → symbols → truncate. Currency BEFORE numbers — altfel "12.50 RON" se degradează la "doisprezece virgulă cincizeci RON". Romanian "de" particle rule: n>=20 AND (n%100 not in 1..19) → "o sută de lei", "o sută cinci lei" (no "de"). n=1 with currency → "un dolar" / "un leu" (article, nu cardinal). 35/35 tests pass: markdown(5), cardinals(6), decimals(4), currency RON/USD/EUR/GBP mix(8), symbols(4), abbreviations(4), truncation(2), edge cases empty/whitespace(2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 14:42:41 +00:00
Marius Mutu	af5af8133f	feat(voice): Pas 2 — install voice deps, vendor discord-ext-voice-recv, setup assets Foundation pentru Discord voice-to-voice pipeline. - requirements.txt: faster-whisper, silero-vad, num2words, numpy, PyNaCl - vendor/discord-ext-voice-recv/: vendored la commit ac04ea7b09 (bump version 0.5.3a) — Discord voice protocol fragil, upstream hobby fork. Adapter layer in src/voice/_discord_voice_adapter.py izolează churn (swap la py-cord = doar acel fișier rescris). VENDOR_INFO.md documentează update procedure. - tools/voice_setup.py: idempotent setup script — libopus check, ffmpeg check, Supertonic reachable, faster-whisper/silero-vad warm, assets generation. Exit 0 = green, 1 = needs human (currently libopus missing needs `sudo apt install -y libopus0`). - assets/voice/: thinking.wav (filler "Stai puțin să-mi adun gândurile", ~2.8s), mhm.wav (listener noise), beep_200ms.wav (wake-up tone 880Hz). - src/voice/__init__.py: package stub. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 14:42:27 +00:00

6 Commits