NLP Master: pipeline download + transcribe + summarize

- run.bat: one-click pipeline (download, convert, transcribe)
- download.py: fetch audio from course platform
- transcribe.py: whisper.cpp batch transcription (CPU, WAV 16kHz)
  - MP3->WAV conversion via ffmpeg
  - --modules filter for splitting work across machines
- summarize.py: generate summaries from transcripts
- setup_whisper.py: auto-download whisper.cpp, ffmpeg, and model
- Medium model (q5_0) instead of large to avoid VRAM crashes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-24 01:37:13 +02:00
commit bbc5884545
10 changed files with 2203 additions and 0 deletions

34
.gitignore vendored Normal file
View File

@@ -0,0 +1,34 @@
# Audio files
audio/
*.mp3
*.wav
# Whisper models
models/
*.bin
# Credentials
.env
# Transcripts and summaries (large generated content)
transcripts/
summaries/
# Binaries (downloaded by setup_whisper.py)
whisper-bin/
ffmpeg-bin/
# Temp files
.whisper_bin_path
.ffmpeg_bin_path
# WAV cache (converted from MP3)
audio_wav/
# Python
__pycache__/
*.pyc
.venv/
# Logs
*.log