14 KiB
Design: NLP Master Course Audio Pipeline
Generated by /office-hours on 2026-03-23 Branch: unknown Repo: nlp-master (local, no git) Status: APPROVED Mode: Builder
Problem Statement
Marius has an NLP master course hosted at cursuri.aresens.ro/curs/26 with 35 audio recordings (5 modules x 7 lectures, ~95 minutes each, ~58 hours total) in Romanian. The audio is behind a password-protected website. He wants to download all audio files, transcribe them offline using his AMD Radeon RX 6600M 8GB GPU, and generate clean transcripts with per-lecture summaries as study materials.
What Makes This Cool
58 hours of Romanian lecture audio turned into searchable, summarized study materials — completely automated. Download once, transcribe overnight, summarize with Claude Code. A pipeline that would take weeks of manual work happens in hours.
Constraints
- Hardware: AMD Radeon RX 6600M 8GB (RDNA2) — no CUDA, needs Vulkan or ROCm
- Language: Romanian audio — Whisper large-v3 has decent but not perfect Romanian support (~95% accuracy on clean audio)
- Source: Password-protected website at cursuri.aresens.ro/curs/26
- Scale: ~35 MP3 files, ~95 min each, ~58 hours total
- Privacy: Course content is for personal study use only
- Tooling: Claude Code available for summary generation (no separate API cost)
- Platform: Native Windows (Python + whisper.cpp + Vulkan). Claude Code runs from WSL2 for summaries.
- Summaries language: Romanian (matching source material)
- Audio format: MP3, 320kbps, 48kHz stereo, ~218MB per file (verified from sample: "Master 25M1 Z1A [Audio].mp3")
Premises
- Legitimate access to the course — downloading audio for personal study is within usage rights
- whisper.cpp with Vulkan backend is the right tool for RX 6600M (avoids ROCm compatibility issues on RDNA2)
- Audio quality is decent (recorded lectures) — Whisper large-v3 will produce usable Romanian transcripts
- Summaries will be generated by Claude Code after transcription — separate step
- Batch pipeline (download all → transcribe all → summarize all) is preferred over incremental processing
Approaches Considered
Approach A: Full Pipeline (CHOSEN)
Python script for website login + MP3 download. Shell script for whisper.cpp batch transcription (Vulkan, large-v3-q5_0). Claude Code for per-lecture summaries from transcripts.
- Effort: M (human: ~2 days / CC: ~30 min to build, ~8 hours to run transcription)
- Risk: Low
- Pros: Complete automation, reproducible for module 6, best quality
- Cons: whisper.cpp Vulkan build requires system setup
Approach B: Download + Transcribe Only
Same download + transcription, no automated summaries. Simpler but defers the valuable part.
- Effort: S (human: ~1 day / CC: ~20 min)
- Risk: Low
Approach C: Fully Offline (Local LLM summaries)
Everything offline including summaries via llama.cpp. Zero external costs but lower summary quality.
- Effort: M (human: ~2 days / CC: ~40 min)
- Risk: Medium (8GB VRAM shared between whisper.cpp and llama.cpp)
Recommended Approach
Approach A: Full Pipeline — Download → whisper.cpp/Vulkan → Claude Code summaries.
Execution model: Everything runs on native Windows (Python, whisper.cpp). Claude Code runs from WSL2 for the summary step.
Step 0: Project Setup
- Initialize git repo with
.gitignore(exclude:audio/,models/,.env,*.mp3,*.wav,*.bin) - Install Python on Windows (if not already)
- Install Vulkan SDK on Windows
- Create
.envwith course credentials (never committed)
Step 1: Site Recon + Download Audio Files
- First: Browse cursuri.aresens.ro/curs/26 to understand page structure (login form, module layout, MP3 link format)
- Based on recon, write
download.pyusing the right scraping approach (requests+BS4 for static, playwright for JS-rendered — don't build both) - Login with credentials from
.envor interactive prompt - Discover all modules dynamically (don't hardcode 5x7 — actual count may vary)
- Preserve original file names (e.g., "Master 25M1 Z1A [Audio].mp3") and extract lecture titles
- Write
manifest.jsonmapping each file to: module, lecture title, original URL, file path, download status - Resumability: skip already-downloaded files (check existence + file size). Retry 3x with backoff. Log to
download_errors.log. - Validation: after download completes, print summary: "Downloaded X/Y files, Z failures. All files > 1MB: pass/fail."
Step 2: Install whisper.cpp with Vulkan (Windows native)
- Option A: Download pre-built Windows binary with Vulkan from whisper.cpp-windows-vulkan-bin
- Option B: Build from source with Visual Studio +
-DGGML_VULKAN=1CMake flag - Download model:
ggml-large-v3-q5_0.bin(~1.5GB) from Hugging Face intomodels/ - VRAM test: transcribe a 2-min clip from the first lecture to verify GPU detection, measure speed, and validate MP3 input works. If MP3 fails (whisper.cpp built without ffmpeg libs), install ffmpeg or pre-convert with Python pydub.
- Speed calibration: RX 6600M is roughly half the speed of RX 9070 XT. Realistic estimate: 3-5x realtime (~18-30 min per 90-min file). Total: ~12-18 hours for all files. Plan for a full day, not overnight.
- Fallback: if large-v3-q5_0 OOMs on 8GB, try
ggml-large-v3-q4_0.binorggml-medium-q5_0.bin.
Step 3: Batch Transcription
transcribe.py(Python, cross-platform) readsmanifest.json, processes files in module order- Calls whisper.cpp with:
--language ro --model models\ggml-large-v3-q5_0.bin --output-txt --output-srt - Output .txt and .srt per file to
transcripts/{original_name_without_ext}/ - Updates
manifest.jsonwith transcription status per file - Resumability: skip files with existing .txt output. Log failures to
transcribe_errors.log. - Quality gate: after first module completes (~2 hours), STOP and spot-check 2-3 transcripts. If Romanian accuracy is poor (lots of garbled text), consider: switching to
large-v3unquantized, adjusting--beam-size, or accepting lower quality. - Validation: print summary: "Transcribed X/Y files. Z failures. No empty outputs: pass/fail."
Step 4: Summary Generation with Claude Code
- From WSL2, use Claude Code to process each transcript
- Use a Python script (
summarize.py) that readsmanifest.json, opens each .txt file, and prints the summary prompt for Claude Code - Summary prompt (Romanian): "Rezuma aceasta transcriere. Ofera: (1) prezentare generala in 3-5 propozitii, (2) concepte cheie cu definitii, (3) detalii si exemple importante"
- Chunking: split transcripts > 10K words at sentence boundaries (not raw word count) with 500-word overlap. Summarize chunks, then merge.
- Output to
summaries/{original_name}_summary.md - Final: compile
SUPORT_CURS.md— master study guide with lecture titles as headings
Manifest Schema
{
"course": "NLP Master 2025",
"source_url": "https://cursuri.aresens.ro/curs/26",
"modules": [
{
"name": "Modul 1",
"lectures": [
{
"title": "Master 25M1 Z1A",
"original_filename": "Master 25M1 Z1A [Audio].mp3",
"url": "https://...",
"audio_path": "audio/Master 25M1 Z1A [Audio].mp3",
"transcript_path": "transcripts/Master 25M1 Z1A.txt",
"srt_path": "transcripts/Master 25M1 Z1A.srt",
"summary_path": "summaries/Master 25M1 Z1A_summary.md",
"download_status": "complete",
"transcribe_status": "pending",
"file_size_bytes": 228486429
}
]
}
]
}
Directory Structure
nlp-master/
.gitignore # Excludes audio/, models/, .env
.env # Course credentials (not committed)
manifest.json # Shared metadata for all scripts
download.py # Step 1: site recon + download
transcribe.py # Step 3: batch transcription
summarize.py # Step 4: summary generation helper
audio/
Master 25M1 Z1A [Audio].mp3
Master 25M1 Z1B [Audio].mp3
...
models/
ggml-large-v3-q5_0.bin
transcripts/
Master 25M1 Z1A.txt
Master 25M1 Z1A.srt
...
summaries/
Master 25M1 Z1A_summary.md
...
SUPORT_CURS.md
Open Questions
What is the exact website structure?Resolved: browse site first in Step 1.Are there lecture titles on the website?Resolved: preserve original names + extract titles.Do you want the summaries in Romanian or English?Resolved: Romanian.- Should the master study guide (SUPORT_CURS.md) include the full transcripts or just summaries?
- Is there a 6th module coming? If so, the pipeline should be easily re-runnable.
- Does whisper.cpp Windows binary support MP3 input natively? (Validated in Step 2 VRAM test)
Success Criteria
- All ~35 MP3 files downloaded and organized by module
- All files transcribed to .txt and .srt with >90% accuracy
- Per-lecture summaries generated with key concepts extracted
- Master study guide (SUPORT_CURS.md) ready for reading/searching
- Pipeline is re-runnable for module 6 when it arrives
Next Steps
- git init + .gitignore — set up project, exclude audio/models/.env (~2 min)
- Browse cursuri.aresens.ro — understand site structure before coding (~10 min)
- Build download.py — login + scrape + download + manifest.json (~15 min with CC)
- Install whisper.cpp on Windows — pre-built binary or build from source + Vulkan SDK (~15 min)
- Download whisper model — large-v3-q5_0 from Hugging Face (~5 min)
- Test transcription — 2-min clip, validate GPU, calibrate speed, check MP3 support (~5 min)
- Build transcribe.py — reads manifest, processes in module order, updates status (~10 min with CC)
- Run batch transcription — ~12-18 hours (leave running during workday)
- Spot-check quality — review 2-3 transcripts after Module 1 completes
- Generate summaries with Claude Code — via summarize.py helper (~30 min)
- Compile SUPORT_CURS.md — master study guide (~10 min)
NOT in scope
- Building a web UI or search interface for transcripts — just flat files
- Automated quality scoring of transcriptions — manual spot-check is sufficient
- Speaker diarization (identifying different speakers) — single lecturer
- Translation to English — summaries stay in Romanian
- CI/CD or deployment — this is a local personal pipeline
What already exists
- Nothing — greenfield project. No existing code to reuse.
- The one existing file (
Master 25M1 Z1A [Audio].mp3) confirms the naming pattern and audio specs.
Failure Modes
FAILURE MODE | TEST? | HANDLING? | SILENT?
================================|=======|===========|========
Session expires during download | No | Yes (retry)| No — logged
MP3 truncated (network drop) | Yes* | Yes (size) | No — validation
whisper.cpp OOM on large model | No | Yes (fallback)| No — logged
whisper.cpp can't read MP3 | No | No** | Yes — CRITICAL
Empty transcript output | Yes* | Yes (log) | No — validation
Poor Romanian accuracy | No | Yes (gate)| No — spot-check
Claude Code input too large | No | Yes (chunk)| No — script handles
manifest.json corruption | No | No | Yes — low risk
* = covered by inline validation checks
** = validated in Step 2 test; if fails, install ffmpeg or use pydub
Critical gap: whisper.cpp MP3 support must be validated in Step 2. If it fails silently (produces garbage), the entire batch is wasted.
Eng Review Decisions (2026-03-24)
- Hybrid platform → All Windows Python (not WSL2 for scripts)
- Browse site first → build the right scraper, not two fallback paths
- Preserve original file names + extract lecture titles
- Add manifest.json as shared metadata between scripts
- Python for all scripts (download.py, transcribe.py, summarize.py)
- Built-in validation checks in each script
- Feed MP3s directly (no pre-convert)
- Process in module order
- Realistic transcription estimate: 12-18 hours (not 7-8)
What I noticed about how you think
- You said "vreau offline transcription + claude code pentru summaries" — you immediately found the pragmatic middle path between fully offline and fully API-dependent. That's good engineering instinct: use the best tool for each step rather than forcing one tool to do everything.
- You gave concrete numbers upfront: "5 module din 6, fiecare cu 7 audio-uri" and "90-100 minute" — you'd already scoped the problem before sitting down. That's not how most people start; most people say "I have some audio files."
- You chose "transcripts + summaries" over "just transcripts" or "full study system" — you know what's useful without over-engineering.
GSTACK REVIEW REPORT
| Review | Trigger | Why | Runs | Status | Findings |
|---|---|---|---|---|---|
| CEO Review | /plan-ceo-review |
Scope & strategy | 0 | — | — |
| Codex Review | /codex review |
Independent 2nd opinion | 0 | — | — |
| Eng Review | /plan-eng-review |
Architecture & tests (required) | 1 | CLEAR (PLAN) | 8 issues, 0 critical gaps |
| Design Review | /plan-design-review |
UI/UX gaps | 0 | — | — |
- OUTSIDE VOICE: Claude subagent ran — 10 findings, 3 cross-model tensions resolved (platform execution, speed estimate, module order)
- UNRESOLVED: 0
- VERDICT: ENG CLEARED — ready to implement