244 lines
14 KiB
Markdown
244 lines
14 KiB
Markdown
# Design: NLP Master Course Audio Pipeline
|
|
|
|
Generated by /office-hours on 2026-03-23
|
|
Branch: unknown
|
|
Repo: nlp-master (local, no git)
|
|
Status: APPROVED
|
|
Mode: Builder
|
|
|
|
## Problem Statement
|
|
|
|
Marius has an NLP master course hosted at cursuri.aresens.ro/curs/26 with 35 audio recordings (5 modules x 7 lectures, ~95 minutes each, ~58 hours total) in Romanian. The audio is behind a password-protected website. He wants to download all audio files, transcribe them offline using his AMD Radeon RX 6600M 8GB GPU, and generate clean transcripts with per-lecture summaries as study materials.
|
|
|
|
## What Makes This Cool
|
|
|
|
58 hours of Romanian lecture audio turned into searchable, summarized study materials — completely automated. Download once, transcribe overnight, summarize with Claude Code. A pipeline that would take weeks of manual work happens in hours.
|
|
|
|
## Constraints
|
|
|
|
- **Hardware:** AMD Radeon RX 6600M 8GB (RDNA2) — no CUDA, needs Vulkan or ROCm
|
|
- **Language:** Romanian audio — Whisper large-v3 has decent but not perfect Romanian support (~95% accuracy on clean audio)
|
|
- **Source:** Password-protected website at cursuri.aresens.ro/curs/26
|
|
- **Scale:** ~35 MP3 files, ~95 min each, ~58 hours total
|
|
- **Privacy:** Course content is for personal study use only
|
|
- **Tooling:** Claude Code available for summary generation (no separate API cost)
|
|
- **Platform:** Native Windows (Python + whisper.cpp + Vulkan). Claude Code runs from WSL2 for summaries.
|
|
- **Summaries language:** Romanian (matching source material)
|
|
- **Audio format:** MP3, 320kbps, 48kHz stereo, ~218MB per file (verified from sample: "Master 25M1 Z1A [Audio].mp3")
|
|
|
|
## Premises
|
|
|
|
1. Legitimate access to the course — downloading audio for personal study is within usage rights
|
|
2. whisper.cpp with Vulkan backend is the right tool for RX 6600M (avoids ROCm compatibility issues on RDNA2)
|
|
3. Audio quality is decent (recorded lectures) — Whisper large-v3 will produce usable Romanian transcripts
|
|
4. Summaries will be generated by Claude Code after transcription — separate step
|
|
5. Batch pipeline (download all → transcribe all → summarize all) is preferred over incremental processing
|
|
|
|
## Approaches Considered
|
|
|
|
### Approach A: Full Pipeline (CHOSEN)
|
|
Python script for website login + MP3 download. Shell script for whisper.cpp batch transcription (Vulkan, large-v3-q5_0). Claude Code for per-lecture summaries from transcripts.
|
|
- Effort: M (human: ~2 days / CC: ~30 min to build, ~8 hours to run transcription)
|
|
- Risk: Low
|
|
- Pros: Complete automation, reproducible for module 6, best quality
|
|
- Cons: whisper.cpp Vulkan build requires system setup
|
|
|
|
### Approach B: Download + Transcribe Only
|
|
Same download + transcription, no automated summaries. Simpler but defers the valuable part.
|
|
- Effort: S (human: ~1 day / CC: ~20 min)
|
|
- Risk: Low
|
|
|
|
### Approach C: Fully Offline (Local LLM summaries)
|
|
Everything offline including summaries via llama.cpp. Zero external costs but lower summary quality.
|
|
- Effort: M (human: ~2 days / CC: ~40 min)
|
|
- Risk: Medium (8GB VRAM shared between whisper.cpp and llama.cpp)
|
|
|
|
## Recommended Approach
|
|
|
|
**Approach A: Full Pipeline** — Download → whisper.cpp/Vulkan → Claude Code summaries.
|
|
|
|
**Execution model:** Everything runs on native Windows (Python, whisper.cpp). Claude Code runs from WSL2 for the summary step.
|
|
|
|
### Step 0: Project Setup
|
|
- Initialize git repo with `.gitignore` (exclude: `audio/`, `models/`, `.env`, `*.mp3`, `*.wav`, `*.bin`)
|
|
- Install Python on Windows (if not already)
|
|
- Install Vulkan SDK on Windows
|
|
- Create `.env` with course credentials (never committed)
|
|
|
|
### Step 1: Site Recon + Download Audio Files
|
|
- **First:** Browse cursuri.aresens.ro/curs/26 to understand page structure (login form, module layout, MP3 link format)
|
|
- Based on recon, write `download.py` using the right scraping approach (requests+BS4 for static, playwright for JS-rendered — don't build both)
|
|
- Login with credentials from `.env` or interactive prompt
|
|
- Discover all modules dynamically (don't hardcode 5x7 — actual count may vary)
|
|
- Preserve original file names (e.g., "Master 25M1 Z1A [Audio].mp3") and extract lecture titles
|
|
- Write `manifest.json` mapping each file to: module, lecture title, original URL, file path, download status
|
|
- **Resumability:** skip already-downloaded files (check existence + file size). Retry 3x with backoff. Log to `download_errors.log`.
|
|
- **Validation:** after download completes, print summary: "Downloaded X/Y files, Z failures. All files > 1MB: pass/fail."
|
|
|
|
### Step 2: Install whisper.cpp with Vulkan (Windows native)
|
|
- Option A: Download pre-built Windows binary with Vulkan from [whisper.cpp-windows-vulkan-bin](https://github.com/jerryshell/whisper.cpp-windows-vulkan-bin)
|
|
- Option B: Build from source with Visual Studio + `-DGGML_VULKAN=1` CMake flag
|
|
- Download model: `ggml-large-v3-q5_0.bin` (~1.5GB) from Hugging Face into `models/`
|
|
- **VRAM test:** transcribe a 2-min clip from the first lecture to verify GPU detection, measure speed, and validate MP3 input works. If MP3 fails (whisper.cpp built without ffmpeg libs), install ffmpeg or pre-convert with Python pydub.
|
|
- **Speed calibration:** RX 6600M is roughly half the speed of RX 9070 XT. Realistic estimate: **3-5x realtime** (~18-30 min per 90-min file). Total: **~12-18 hours** for all files. Plan for a full day, not overnight.
|
|
- **Fallback:** if large-v3-q5_0 OOMs on 8GB, try `ggml-large-v3-q4_0.bin` or `ggml-medium-q5_0.bin`.
|
|
|
|
### Step 3: Batch Transcription
|
|
- `transcribe.py` (Python, cross-platform) reads `manifest.json`, processes files in module order
|
|
- Calls whisper.cpp with: `--language ro --model models\ggml-large-v3-q5_0.bin --output-txt --output-srt`
|
|
- Output .txt and .srt per file to `transcripts/{original_name_without_ext}/`
|
|
- Updates `manifest.json` with transcription status per file
|
|
- **Resumability:** skip files with existing .txt output. Log failures to `transcribe_errors.log`.
|
|
- **Quality gate:** after first module completes (~2 hours), STOP and spot-check 2-3 transcripts. If Romanian accuracy is poor (lots of garbled text), consider: switching to `large-v3` unquantized, adjusting `--beam-size`, or accepting lower quality.
|
|
- **Validation:** print summary: "Transcribed X/Y files. Z failures. No empty outputs: pass/fail."
|
|
|
|
### Step 4: Summary Generation with Claude Code
|
|
- From WSL2, use Claude Code to process each transcript
|
|
- Use a Python script (`summarize.py`) that reads `manifest.json`, opens each .txt file, and prints the summary prompt for Claude Code
|
|
- Summary prompt (Romanian): "Rezuma aceasta transcriere. Ofera: (1) prezentare generala in 3-5 propozitii, (2) concepte cheie cu definitii, (3) detalii si exemple importante"
|
|
- **Chunking:** split transcripts > 10K words at sentence boundaries (not raw word count) with 500-word overlap. Summarize chunks, then merge.
|
|
- Output to `summaries/{original_name}_summary.md`
|
|
- Final: compile `SUPORT_CURS.md` — master study guide with lecture titles as headings
|
|
|
|
### Manifest Schema
|
|
```json
|
|
{
|
|
"course": "NLP Master 2025",
|
|
"source_url": "https://cursuri.aresens.ro/curs/26",
|
|
"modules": [
|
|
{
|
|
"name": "Modul 1",
|
|
"lectures": [
|
|
{
|
|
"title": "Master 25M1 Z1A",
|
|
"original_filename": "Master 25M1 Z1A [Audio].mp3",
|
|
"url": "https://...",
|
|
"audio_path": "audio/Master 25M1 Z1A [Audio].mp3",
|
|
"transcript_path": "transcripts/Master 25M1 Z1A.txt",
|
|
"srt_path": "transcripts/Master 25M1 Z1A.srt",
|
|
"summary_path": "summaries/Master 25M1 Z1A_summary.md",
|
|
"download_status": "complete",
|
|
"transcribe_status": "pending",
|
|
"file_size_bytes": 228486429
|
|
}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Directory Structure
|
|
```
|
|
nlp-master/
|
|
.gitignore # Excludes audio/, models/, .env
|
|
.env # Course credentials (not committed)
|
|
manifest.json # Shared metadata for all scripts
|
|
download.py # Step 1: site recon + download
|
|
transcribe.py # Step 3: batch transcription
|
|
summarize.py # Step 4: summary generation helper
|
|
audio/
|
|
Master 25M1 Z1A [Audio].mp3
|
|
Master 25M1 Z1B [Audio].mp3
|
|
...
|
|
models/
|
|
ggml-large-v3-q5_0.bin
|
|
transcripts/
|
|
Master 25M1 Z1A.txt
|
|
Master 25M1 Z1A.srt
|
|
...
|
|
summaries/
|
|
Master 25M1 Z1A_summary.md
|
|
...
|
|
SUPORT_CURS.md
|
|
```
|
|
|
|
## Open Questions
|
|
|
|
1. ~~What is the exact website structure?~~ Resolved: browse site first in Step 1.
|
|
2. ~~Are there lecture titles on the website?~~ Resolved: preserve original names + extract titles.
|
|
3. ~~Do you want the summaries in Romanian or English?~~ Resolved: Romanian.
|
|
4. Should the master study guide (SUPORT_CURS.md) include the full transcripts or just summaries?
|
|
5. Is there a 6th module coming? If so, the pipeline should be easily re-runnable.
|
|
6. Does whisper.cpp Windows binary support MP3 input natively? (Validated in Step 2 VRAM test)
|
|
|
|
## Success Criteria
|
|
|
|
- All ~35 MP3 files downloaded and organized by module
|
|
- All files transcribed to .txt and .srt with >90% accuracy
|
|
- Per-lecture summaries generated with key concepts extracted
|
|
- Master study guide (SUPORT_CURS.md) ready for reading/searching
|
|
- Pipeline is re-runnable for module 6 when it arrives
|
|
|
|
## Next Steps
|
|
|
|
1. **git init + .gitignore** — set up project, exclude audio/models/.env (~2 min)
|
|
2. **Browse cursuri.aresens.ro** — understand site structure before coding (~10 min)
|
|
3. **Build download.py** — login + scrape + download + manifest.json (~15 min with CC)
|
|
4. **Install whisper.cpp on Windows** — pre-built binary or build from source + Vulkan SDK (~15 min)
|
|
5. **Download whisper model** — large-v3-q5_0 from Hugging Face (~5 min)
|
|
6. **Test transcription** — 2-min clip, validate GPU, calibrate speed, check MP3 support (~5 min)
|
|
7. **Build transcribe.py** — reads manifest, processes in module order, updates status (~10 min with CC)
|
|
8. **Run batch transcription** — ~12-18 hours (leave running during workday)
|
|
9. **Spot-check quality** — review 2-3 transcripts after Module 1 completes
|
|
10. **Generate summaries with Claude Code** — via summarize.py helper (~30 min)
|
|
11. **Compile SUPORT_CURS.md** — master study guide (~10 min)
|
|
|
|
## NOT in scope
|
|
- Building a web UI or search interface for transcripts — just flat files
|
|
- Automated quality scoring of transcriptions — manual spot-check is sufficient
|
|
- Speaker diarization (identifying different speakers) — single lecturer
|
|
- Translation to English — summaries stay in Romanian
|
|
- CI/CD or deployment — this is a local personal pipeline
|
|
|
|
## What already exists
|
|
- Nothing — greenfield project. No existing code to reuse.
|
|
- The one existing file (`Master 25M1 Z1A [Audio].mp3`) confirms the naming pattern and audio specs.
|
|
|
|
## Failure Modes
|
|
```
|
|
FAILURE MODE | TEST? | HANDLING? | SILENT?
|
|
================================|=======|===========|========
|
|
Session expires during download | No | Yes (retry)| No — logged
|
|
MP3 truncated (network drop) | Yes* | Yes (size) | No — validation
|
|
whisper.cpp OOM on large model | No | Yes (fallback)| No — logged
|
|
whisper.cpp can't read MP3 | No | No** | Yes — CRITICAL
|
|
Empty transcript output | Yes* | Yes (log) | No — validation
|
|
Poor Romanian accuracy | No | Yes (gate)| No — spot-check
|
|
Claude Code input too large | No | Yes (chunk)| No — script handles
|
|
manifest.json corruption | No | No | Yes — low risk
|
|
|
|
* = covered by inline validation checks
|
|
** = validated in Step 2 test; if fails, install ffmpeg or use pydub
|
|
```
|
|
**Critical gap:** whisper.cpp MP3 support must be validated in Step 2. If it fails silently (produces garbage), the entire batch is wasted.
|
|
|
|
## Eng Review Decisions (2026-03-24)
|
|
1. Hybrid platform → **All Windows Python** (not WSL2 for scripts)
|
|
2. Browse site first → build the right scraper, not two fallback paths
|
|
3. Preserve original file names + extract lecture titles
|
|
4. Add manifest.json as shared metadata between scripts
|
|
5. Python for all scripts (download.py, transcribe.py, summarize.py)
|
|
6. Built-in validation checks in each script
|
|
7. Feed MP3s directly (no pre-convert)
|
|
8. Process in module order
|
|
9. Realistic transcription estimate: 12-18 hours (not 7-8)
|
|
|
|
## What I noticed about how you think
|
|
|
|
- You said "vreau offline transcription + claude code pentru summaries" — you immediately found the pragmatic middle path between fully offline and fully API-dependent. That's good engineering instinct: use the best tool for each step rather than forcing one tool to do everything.
|
|
- You gave concrete numbers upfront: "5 module din 6, fiecare cu 7 audio-uri" and "90-100 minute" — you'd already scoped the problem before sitting down. That's not how most people start; most people say "I have some audio files."
|
|
- You chose "transcripts + summaries" over "just transcripts" or "full study system" — you know what's useful without over-engineering.
|
|
|
|
## GSTACK REVIEW REPORT
|
|
|
|
| Review | Trigger | Why | Runs | Status | Findings |
|
|
|--------|---------|-----|------|--------|----------|
|
|
| CEO Review | `/plan-ceo-review` | Scope & strategy | 0 | — | — |
|
|
| Codex Review | `/codex review` | Independent 2nd opinion | 0 | — | — |
|
|
| Eng Review | `/plan-eng-review` | Architecture & tests (required) | 1 | CLEAR (PLAN) | 8 issues, 0 critical gaps |
|
|
| Design Review | `/plan-design-review` | UI/UX gaps | 0 | — | — |
|
|
|
|
- **OUTSIDE VOICE:** Claude subagent ran — 10 findings, 3 cross-model tensions resolved (platform execution, speed estimate, module order)
|
|
- **UNRESOLVED:** 0
|
|
- **VERDICT:** ENG CLEARED — ready to implement
|