Commit Graph

8 Commits

Author SHA1 Message Date
0ce8a5a04d Update cron, dashboard, root +3 more (+1 ~11) 2026-05-28 20:21:28 +00:00
e79bed7afe feat(voice): unify Discord voice↔text session (squash of voice/text-unify)
Voice utterances and text messages on the same Discord channel now share
one Claude session, and Echo's voice replies are mirrored back into the
text channel. Replaces the old voice:<id> session-key split.

Changes:
- src/adapters/_text_chunks.py: new leaf module for split_message
  (used by both discord_bot and voice pipeline)
- src/router.py: drop voice: prefix from session_key; add [voice] marker;
  strip leading [speaker:/[voice] tokens from user input (anti-jailbreak);
  remove dead double-clear of voice: key
- src/claude_session.py: include personality/VOICE_MODE.md unconditionally
  (rules become per-turn-aware via [speaker:] prefix instead of session flag)
- src/voice/pipeline.py: VoiceSession splits text_channel_id +
  voice_channel_id; resolve text channel per-send (no stale refs); mirror
  Echo's reply text into the text channel after route_message returns
- src/adapters/discord_voice.py: /voice join passes both channel ids
- src/adapters/discord_bot.py: import split_message from leaf module
- personality/VOICE_MODE.md: rewrite as per-turn dynamic rules;
  add synthesis instructions for text turns after voice turns

Tests:
- tests/test_router.py: 4 new cases (plain channel_id, anti-jailbreak,
  text-adapter regression, no-double-clear)
- tests/test_pipeline_mirror.py: new — Echo reply mirror chunking,
  empty guard, mirror_enabled=False, send-raises resilience
- tests/test_voice_session_channel_ids.py: new — split-attr contract
  + metrics payload schema
- tests/test_voice_session_cleanup.py: update for new kwargs

Plan: /home/moltbot/.claude/plans/vreau-ca-tot-textul-greedy-rivest.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 14:24:15 +00:00
4be70440e8 feat(voice): DAVE E2E + full voice UX (squash of voice/dave-recv)
Squashed branch: voice/dave-recv → master. Closes Pas 12 (DAVE E2E) and lands
voice-mode UX polish + verbal voice control on top of the Pas 1-10 scaffolding
already on master.

## DAVE E2E receive-side decrypt (e4f3177)

Vendored fork: discord-ext-voice-recv 0.5.3a+echo.dave1. Patches the receive
pipeline to handle Discord's mandatory DAVE encryption on voice gateway v=8.
- `_maybe_dave_decrypt`: uses davey.can_passthrough(user_id) as primary gate,
  falls through to dave.decrypt for DAVE-epoch peers, drops on decrypt failure
  without killing the reader thread.
- VAD fix: silero-vad v5+ requires exactly 512 samples; our 100ms window
  (1600 samples) was silently raising ValueError → STT never fired. Now slice
  into 512-sample chunks.
- Whisper: bumped beam_size 1→5 and added RO initial_prompt.
- Tests: 11 DAVE unit tests + 2 callback integration tests + contract test
  with fork-version guard.

## Voice UX polish (d1bc77e)

- Killed the 3s "mă gândesc" filler (always collided with Claude p50 4-7s).
- Barge-in via `ttsq.clear()` at top of `on_segment_done`.
- DTX silence-flush poller (200ms tick) — Discord stops sending RTP packets
  when silent, so the inline silence-check in sink.write() never fired for
  trailing audio; background thread handles it.
- `EchoStreamingAudioSource.read()` non-blocking — old `get_frame(timeout=0.1)`
  wrecked Discord's 20ms cadence and the client interpreted bursts as
  stuttering (Marius heard "4 de minute" instead of full sentence).
- RO time expansion: 23:09 → "douăzeci și trei și nouă minute".
- Supertonic Unicode sanitize centralized in tools/tts.py.
- Whisper local_files_only=True — no HF metadata GET on each startup.
- Diagnostic logging through sink → VAD → Claude stream → TTS chain.

## Voice mode iteration (e589e48)

- `personality/VOICE_MODE.md` — voice-tailored system prompt (short, no
  markdown, no abbreviations, time without seconds, distances in
  "mii"/"milioane"); plumbed via build_system_prompt(voice_mode=True).
- Isolated voice session key `voice:<channel_id>` — voice doesn't share
  context with text adapter on the same channel; auto-applied without
  /clear ceremony. /clear drops both keys.
- Metric units + Romanian thousands (normalize.py): "384.000 km" →
  "trei sute optzeci și patru de mii de kilometri" with feminine-correct
  pluralization and "de" particle for ≥20.
- `/voice setvoice <M1-F5>` slash command with native autocomplete; swaps
  live + persists voice.default_voice to config.json.
- Verbal voice change (src/voice/voice_commands.py + 29 tests) — "schimbă
  vocea pe M5", "voce em cinci", with permissive substring fallback for
  Whisper-mangled forms like "Mâcinci"=M5 and "unul cinci"=M5. Whisper
  initial_prompt now lists voice vocabulary to bias STT toward clean
  outputs.
- Fast barge-in: VAD ≥2 consecutive windows (~200ms) on Marius's user
  while Echo has pending TTS frames → cut him off mid-sentence so user
  doesn't wait the full silence + STT cycle. Acoustic echo bleed-through
  still requires headphones (no AEC).

## Test suite

130 voice + router tests pass (test_voice_recv_dave, test_voice_session_cleanup,
test_voice_adapter_contract, test_voice_normalize, test_voice_commands,
test_router).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 21:00:27 +00:00
23666f7910 feat(voice): Pas 5 — voice/pipeline.py VoiceSession + EchoVoiceSink + cleanup
Central voice pipeline (~250 LOC + docstrings = ~430 lines):

VoiceSession (context manager + idempotent cleanup pe 5 căi):
- __enter__: acquire _lock, open JSONL (record=on)
- __exit__: calls cleanup("exit"), nu suprimă exceptions
- cleanup(reason): IDEMPOTENT, side effects o singură dată — JSONL
  flush+close (record=on) sau delete (record=off), bot presence cleared,
  voice_client.cleanup(), ttsq.stop(), cancel filler task, lock release,
  structured log la logs/voice_metrics.jsonl
- on_segment_done(speaker_id, text, no_speech_prob): mirror text channel,
  append JSONL, arm 3s filler timer, route_message cu on_text callback
  + cancel filler la first block
- last_activity_ts: time.monotonic() — caller-driven 5min auto-leave

EchoVoiceSink(session, bot_user_id):
- wants_opus() False (PCM)
- write() runs în voice_recv reader thread (threading primitives only):
  - GUARD 1: user None/id==0/id==bot_user_id → return (load-bearing
    echo prevention)
  - GUARD 2: whitelist filter (empty = allow all)
  - Buffer 20ms packets per-user → batch 100ms (5×20ms = 19200 bytes)
    → silero-vad threshold 0.5 → 800ms cumulative silence flush
  - _flush_to_stt: faster-whisper small int8 cpu_threads=4 lang=ro
    beam_size=1, no_speech_prob > 0.6 drop, schedule on_segment_done
    via run_coroutine_threadsafe pe session.loop

Module helpers (lazy thread-safe singletons): _get_whisper_model,
_get_silero_vad. Constants: FILLER_DELAY_S=3.0, SILENCE_FLUSH_MS=800,
VAD_THRESHOLD=0.5, VAD_WINDOW_MS=100, NO_SPEECH_DROP_THRESHOLD=0.6.

Decisions:
- STT runs in audio thread — acceptable la 2.25s p50 (user just stopped
  talking, no batching contention). Wrap în ThreadPoolExecutor.submit
  if perf bites later.
- Downsample 48k→16k via 3-sample averaging (no scipy dep). Whisper
  robust la mild aliasing.
- Energy-RMS VAD fallback dacă torch import fail — graceful degrade.
- router_route_message injection seam ca kwarg pentru testabilitate.
- bot.change_presence handling cross-thread via run_coroutine_threadsafe.

tests/test_voice_session_cleanup.py — 6 tests:
- voice_leave / disconnect / crash via __exit__ / auto_leave /
  user_left_channel (5 cleanup paths each verified for: JSONL state,
  presence cleared, voice_client.cleanup, ttsq.stop, lock release,
  idempotency)
- 1 robustness cross-cut (double-cleanup safety)

6/6 PASS. Regression suite 63/63 PASS (normalize + adapter + mutex).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 14:55:57 +00:00
217da65417 feat(voice): Pas 6 — voice/tts_stream.py streaming TTS pipeline
src/voice/tts_stream.py (~280 lines):
- clause_segments(text, min_words=8): yield Romanian-aware clause chunks.
  Split la punct (./!/?;:,) cu accumulation până min_words satisfied;
  edge case text < min_words → single chunk. NU split mid-word/number/
  currency. Romanian intonație de frază se rupe la sentence break — 8+
  words minimizează seams.
- TTSQueue worker thread: text queue in → PCM frames out. Methods:
  start/stop/push_text/push_filler/clear/is_empty. normalize_for_tts()
  apply first, then clause_segments(), then Supertonic synth per chunk.
- EchoStreamingAudioSource(discord.AudioSource): read() pull from PCM
  queue, 20ms frames (3840 bytes 48kHz s16le stereo). Eliminates RTP
  gap between play() calls — single play() per session, source pulls.
- load_thinking_wav(): one-shot cache → 140 × 20ms frames (~2.8s)
  pentru filler "Stai puțin să-mi adun gândurile".
- wav_to_pcm_20ms_frames(): WAV parser + ffmpeg subprocess resample
  la 48kHz s16le stereo dacă nevoie.

Smoke test (în session): clause_segments behaviour OK, thinking.wav
loads, TTSQueue + EchoStreamingAudioSource construct clean. Integration
testing deferred la convergență (Pas 7 + Pas 11).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 14:44:13 +00:00
a3eefbc799 feat(voice): Pas 4 — _discord_voice_adapter.py thin layer + contract test
Adapter layer peste vendored discord-ext-voice-recv. Re-exports:
VoiceReceiveClient, AudioSink, VoiceData, plus async helper
connect_voice(channel). Discord voice protocol e fragil, upstream e
hobby fork — dacă pică, swap la py-cord = doar acest fișier rescris.

Contract test (22 assertions) prinde drift la upgrade vendor:
- VoiceReceiveClient methods: connect/disconnect/listen/stop_listening/
  is_listening/stop/cleanup
- listen(sink, *, after=None) signature
- sink property (read/write)
- AudioSink methods: write/cleanup/wants_opus + write(self, user, data) arity
- VoiceData slots (packet/source/pcm) + .opus property

Critical pentru Lane PIPE downstream: write() e called from audio thread
(NOT asyncio loop) — threading primitives mandatory pentru EchoVoiceSink.

22/22 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 14:42:50 +00:00
a48562b2f5 feat(voice): Pas 3 — voice/normalize.py + 35 RO test cases
Pure functions pentru TTS text normalization (RO):
- strip_markdown: regex bold/italic/code/link/heading/list
- expand_numbers_ro: num2words pentru cardinals + decimal handling
  ("3.14" → "trei virgulă paisprezece", "3.05" → "trei virgulă zero
  cinci" digit-by-digit la leading zero)
- expand_currency: formă naturală RO ("12.50 RON" → "doisprezece lei
  și cincizeci de bani", "$25.99" → "douăzeci și cinci de dolari și
  nouăzeci și nouă de cenți")
- expand_symbols: %/&/@/° + whitespace collapse
- expand_abbreviations: etc./dl./dna./nr./ş.a./ş.a.m.d.
- normalize_for_tts: full pipeline + hard truncate 200 cuvinte cu
  "Restul l-am scris în chat."

Pipeline order: markdown → abbreviations → currency → numbers →
symbols → truncate. Currency BEFORE numbers — altfel "12.50 RON" se
degradează la "doisprezece virgulă cincizeci RON". Romanian "de"
particle rule: n>=20 AND (n%100 not in 1..19) → "o sută de lei",
"o sută cinci lei" (no "de"). n=1 with currency → "un dolar" /
"un leu" (article, nu cardinal).

35/35 tests pass: markdown(5), cardinals(6), decimals(4), currency
RON/USD/EUR/GBP mix(8), symbols(4), abbreviations(4), truncation(2),
edge cases empty/whitespace(2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 14:42:41 +00:00
af5af8133f feat(voice): Pas 2 — install voice deps, vendor discord-ext-voice-recv, setup assets
Foundation pentru Discord voice-to-voice pipeline.

- requirements.txt: faster-whisper, silero-vad, num2words, numpy, PyNaCl
- vendor/discord-ext-voice-recv/: vendored la commit ac04ea7b09 (bump version
  0.5.3a) — Discord voice protocol fragil, upstream hobby fork. Adapter layer
  in src/voice/_discord_voice_adapter.py izolează churn (swap la py-cord =
  doar acel fișier rescris). VENDOR_INFO.md documentează update procedure.
- tools/voice_setup.py: idempotent setup script — libopus check, ffmpeg
  check, Supertonic reachable, faster-whisper/silero-vad warm, assets
  generation. Exit 0 = green, 1 = needs human (currently libopus missing
  needs `sudo apt install -y libopus0`).
- assets/voice/: thinking.wav (filler "Stai puțin să-mi adun gândurile",
  ~2.8s), mhm.wav (listener noise), beep_200ms.wav (wake-up tone 880Hz).
- src/voice/__init__.py: package stub.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 14:42:27 +00:00