nlp-master/PLAN.md

# Design: NLP Master Course Audio Pipeline

Generated by /office-hours on 2026-03-23
Branch: unknown
Repo: nlp-master (local, no git)
Status: APPROVED
Mode: Builder

## Problem Statement

Marius has an NLP master course hosted at cursuri.aresens.ro/curs/26 with 35 audio recordings (5 modules x 7 lectures, ~95 minutes each, ~58 hours total) in Romanian. The audio is behind a password-protected website. He wants to download all audio files, transcribe them offline using his AMD Radeon RX 6600M 8GB GPU, and generate clean transcripts with per-lecture summaries as study materials.

## What Makes This Cool

58 hours of Romanian lecture audio turned into searchable, summarized study materials — completely automated. Download once, transcribe overnight, summarize with Claude Code. A pipeline that would take weeks of manual work happens in hours.

## Constraints

- **Hardware:** AMD Radeon RX 6600M 8GB (RDNA2) — no CUDA, needs Vulkan or ROCm
- **Language:** Romanian audio — Whisper large-v3 has decent but not perfect Romanian support (~95% accuracy on clean audio)
- **Source:** Password-protected website at cursuri.aresens.ro/curs/26
- **Scale:** ~35 MP3 files, ~95 min each, ~58 hours total
- **Privacy:** Course content is for personal study use only
- **Tooling:** Claude Code available for summary generation (no separate API cost)
- **Platform:** Native Windows (Python + whisper.cpp + Vulkan). Claude Code runs from WSL2 for summaries.
- **Summaries language:** Romanian (matching source material)
- **Audio format:** MP3, 320kbps, 48kHz stereo, ~218MB per file (verified from sample: "Master 25M1 Z1A [Audio].mp3")

## Premises

1. Legitimate access to the course — downloading audio for personal study is within usage rights
2. whisper.cpp with Vulkan backend is the right tool for RX 6600M (avoids ROCm compatibility issues on RDNA2)
3. Audio quality is decent (recorded lectures) — Whisper large-v3 will produce usable Romanian transcripts
4. Summaries will be generated by Claude Code after transcription — separate step
5. Batch pipeline (download all → transcribe all → summarize all) is preferred over incremental processing

## Approaches Considered

### Approach A: Full Pipeline (CHOSEN)
Python script for website login + MP3 download. Shell script for whisper.cpp batch transcription (Vulkan, large-v3-q5_0). Claude Code for per-lecture summaries from transcripts.
- Effort: M (human: ~2 days / CC: ~30 min to build, ~8 hours to run transcription)
- Risk: Low
- Pros: Complete automation, reproducible for module 6, best quality
- Cons: whisper.cpp Vulkan build requires system setup

### Approach B: Download + Transcribe Only
Same download + transcription, no automated summaries. Simpler but defers the valuable part.
- Effort: S (human: ~1 day / CC: ~20 min)
- Risk: Low

### Approach C: Fully Offline (Local LLM summaries)
Everything offline including summaries via llama.cpp. Zero external costs but lower summary quality.
- Effort: M (human: ~2 days / CC: ~40 min)
- Risk: Medium (8GB VRAM shared between whisper.cpp and llama.cpp)

## Recommended Approach

**Approach A: Full Pipeline** — Download → whisper.cpp/Vulkan → Claude Code summaries.

**Execution model:** Everything runs on native Windows (Python, whisper.cpp). Claude Code runs from WSL2 for the summary step.

### Step 0: Project Setup
- Initialize git repo with `.gitignore` (exclude: `audio/`, `models/`, `.env`, `*.mp3`, `*.wav`, `*.bin`)
- Install Python on Windows (if not already)
- Install Vulkan SDK on Windows
- Create `.env` with course credentials (never committed)

### Step 1: Site Recon + Download Audio Files
- **First:** Browse cursuri.aresens.ro/curs/26 to understand page structure (login form, module layout, MP3 link format)
- Based on recon, write `download.py` using the right scraping approach (requests+BS4 for static, playwright for JS-rendered — don't build both)
- Login with credentials from `.env` or interactive prompt
- Discover all modules dynamically (don't hardcode 5x7 — actual count may vary)
- Preserve original file names (e.g., "Master 25M1 Z1A [Audio].mp3") and extract lecture titles
- Write `manifest.json` mapping each file to: module, lecture title, original URL, file path, download status
- **Resumability:** skip already-downloaded files (check existence + file size). Retry 3x with backoff. Log to `download_errors.log`.
- **Validation:** after download completes, print summary: "Downloaded X/Y files, Z failures. All files > 1MB: pass/fail."

### Step 2: Install whisper.cpp with Vulkan (Windows native)
- Option A: Download pre-built Windows binary with Vulkan from [whisper.cpp-windows-vulkan-bin](https://github.com/jerryshell/whisper.cpp-windows-vulkan-bin)
- Option B: Build from source with Visual Studio + `-DGGML_VULKAN=1` CMake flag
- Download model: `ggml-large-v3-q5_0.bin` (~1.5GB) from Hugging Face into `models/`
- **VRAM test:** transcribe a 2-min clip from the first lecture to verify GPU detection, measure speed, and validate MP3 input works. If MP3 fails (whisper.cpp built without ffmpeg libs), install ffmpeg or pre-convert with Python pydub.
- **Speed calibration:** RX 6600M is roughly half the speed of RX 9070 XT. Realistic estimate: **3-5x realtime** (~18-30 min per 90-min file). Total: **~12-18 hours** for all files. Plan for a full day, not overnight.
- **Fallback:** if large-v3-q5_0 OOMs on 8GB, try `ggml-large-v3-q4_0.bin` or `ggml-medium-q5_0.bin`.

### Step 3: Batch Transcription
- `transcribe.py` (Python, cross-platform) reads `manifest.json`, processes files in module order
- Calls whisper.cpp with: `--language ro --model models\ggml-large-v3-q5_0.bin --output-txt --output-srt`
- Output .txt and .srt per file to `transcripts/{original_name_without_ext}/`
- Updates `manifest.json` with transcription status per file
- **Resumability:** skip files with existing .txt output. Log failures to `transcribe_errors.log`.
- **Quality gate:** after first module completes (~2 hours), STOP and spot-check 2-3 transcripts. If Romanian accuracy is poor (lots of garbled text), consider: switching to `large-v3` unquantized, adjusting `--beam-size`, or accepting lower quality.
- **Validation:** print summary: "Transcribed X/Y files. Z failures. No empty outputs: pass/fail."

### Step 4: Summary Generation with Claude Code
- From WSL2, use Claude Code to process each transcript
- Use a Python script (`summarize.py`) that reads `manifest.json`, opens each .txt file, and prints the summary prompt for Claude Code
- Summary prompt (Romanian): "Rezuma aceasta transcriere. Ofera: (1) prezentare generala in 3-5 propozitii, (2) concepte cheie cu definitii, (3) detalii si exemple importante"
- **Chunking:** split transcripts > 10K words at sentence boundaries (not raw word count) with 500-word overlap. Summarize chunks, then merge.
- Output to `summaries/{original_name}_summary.md`
- Final: compile `SUPORT_CURS.md` — master study guide with lecture titles as headings

### Manifest Schema
```json
{
  "course": "NLP Master 2025",
  "source_url": "https://cursuri.aresens.ro/curs/26",
  "modules": [
    {
      "name": "Modul 1",
      "lectures": [
        {
          "title": "Master 25M1 Z1A",
          "original_filename": "Master 25M1 Z1A [Audio].mp3",
          "url": "https://...",
          "audio_path": "audio/Master 25M1 Z1A [Audio].mp3",
          "transcript_path": "transcripts/Master 25M1 Z1A.txt",
          "srt_path": "transcripts/Master 25M1 Z1A.srt",
          "summary_path": "summaries/Master 25M1 Z1A_summary.md",
          "download_status": "complete",
          "transcribe_status": "pending",
          "file_size_bytes": 228486429
        }
      ]
    }
  ]
}
```

### Directory Structure
```
nlp-master/
  .gitignore              # Excludes audio/, models/, .env
  .env                    # Course credentials (not committed)
  manifest.json           # Shared metadata for all scripts
  download.py             # Step 1: site recon + download
  transcribe.py           # Step 3: batch transcription
  summarize.py            # Step 4: summary generation helper
  audio/
    Master 25M1 Z1A [Audio].mp3
    Master 25M1 Z1B [Audio].mp3
    ...
  models/
    ggml-large-v3-q5_0.bin
  transcripts/
    Master 25M1 Z1A.txt
    Master 25M1 Z1A.srt
    ...
  summaries/
    Master 25M1 Z1A_summary.md
    ...
  SUPORT_CURS.md
```

## Open Questions

1. ~~What is the exact website structure?~~ Resolved: browse site first in Step 1.
2. ~~Are there lecture titles on the website?~~ Resolved: preserve original names + extract titles.
3. ~~Do you want the summaries in Romanian or English?~~ Resolved: Romanian.
4. Should the master study guide (SUPORT_CURS.md) include the full transcripts or just summaries?
5. Is there a 6th module coming? If so, the pipeline should be easily re-runnable.
6. Does whisper.cpp Windows binary support MP3 input natively? (Validated in Step 2 VRAM test)

## Success Criteria

- All ~35 MP3 files downloaded and organized by module
- All files transcribed to .txt and .srt with >90% accuracy
- Per-lecture summaries generated with key concepts extracted
- Master study guide (SUPORT_CURS.md) ready for reading/searching
- Pipeline is re-runnable for module 6 when it arrives

## Next Steps

1. **git init + .gitignore** — set up project, exclude audio/models/.env (~2 min)
2. **Browse cursuri.aresens.ro** — understand site structure before coding (~10 min)
3. **Build download.py** — login + scrape + download + manifest.json (~15 min with CC)
4. **Install whisper.cpp on Windows** — pre-built binary or build from source + Vulkan SDK (~15 min)
5. **Download whisper model** — large-v3-q5_0 from Hugging Face (~5 min)
6. **Test transcription** — 2-min clip, validate GPU, calibrate speed, check MP3 support (~5 min)
7. **Build transcribe.py** — reads manifest, processes in module order, updates status (~10 min with CC)
8. **Run batch transcription** — ~12-18 hours (leave running during workday)
9. **Spot-check quality** — review 2-3 transcripts after Module 1 completes
10. **Generate summaries with Claude Code** — via summarize.py helper (~30 min)
11. **Compile SUPORT_CURS.md** — master study guide (~10 min)

## NOT in scope
- Building a web UI or search interface for transcripts — just flat files
- Automated quality scoring of transcriptions — manual spot-check is sufficient
- Speaker diarization (identifying different speakers) — single lecturer
- Translation to English — summaries stay in Romanian
- CI/CD or deployment — this is a local personal pipeline

## What already exists
- Nothing — greenfield project. No existing code to reuse.
- The one existing file (`Master 25M1 Z1A [Audio].mp3`) confirms the naming pattern and audio specs.

## Failure Modes
```
FAILURE MODE                    | TEST? | HANDLING? | SILENT?
================================|=======|===========|========
Session expires during download | No    | Yes (retry)| No — logged
MP3 truncated (network drop)    | Yes*  | Yes (size) | No — validation
whisper.cpp OOM on large model  | No    | Yes (fallback)| No — logged
whisper.cpp can't read MP3      | No    | No**      | Yes — CRITICAL
Empty transcript output         | Yes*  | Yes (log) | No — validation
Poor Romanian accuracy          | No    | Yes (gate)| No — spot-check
Claude Code input too large     | No    | Yes (chunk)| No — script handles
manifest.json corruption        | No    | No        | Yes — low risk

* = covered by inline validation checks
** = validated in Step 2 test; if fails, install ffmpeg or use pydub
```
**Critical gap:** whisper.cpp MP3 support must be validated in Step 2. If it fails silently (produces garbage), the entire batch is wasted.

## Eng Review Decisions (2026-03-24)
1. Hybrid platform → **All Windows Python** (not WSL2 for scripts)
2. Browse site first → build the right scraper, not two fallback paths
3. Preserve original file names + extract lecture titles
4. Add manifest.json as shared metadata between scripts
5. Python for all scripts (download.py, transcribe.py, summarize.py)
6. Built-in validation checks in each script
7. Feed MP3s directly (no pre-convert)
8. Process in module order
9. Realistic transcription estimate: 12-18 hours (not 7-8)

## What I noticed about how you think

- You said "vreau offline transcription + claude code pentru summaries" — you immediately found the pragmatic middle path between fully offline and fully API-dependent. That's good engineering instinct: use the best tool for each step rather than forcing one tool to do everything.
- You gave concrete numbers upfront: "5 module din 6, fiecare cu 7 audio-uri" and "90-100 minute" — you'd already scoped the problem before sitting down. That's not how most people start; most people say "I have some audio files."
- You chose "transcripts + summaries" over "just transcripts" or "full study system" — you know what's useful without over-engineering.

## GSTACK REVIEW REPORT

| Review | Trigger | Why | Runs | Status | Findings |
|--------|---------|-----|------|--------|----------|
| CEO Review | `/plan-ceo-review` | Scope & strategy | 0 | — | — |
| Codex Review | `/codex review` | Independent 2nd opinion | 0 | — | — |
| Eng Review | `/plan-eng-review` | Architecture & tests (required) | 1 | CLEAR (PLAN) | 8 issues, 0 critical gaps |
| Design Review | `/plan-design-review` | UI/UX gaps | 0 | — | — |

- **OUTSIDE VOICE:** Claude subagent ran — 10 findings, 3 cross-model tensions resolved (platform execution, speed estimate, module order)
- **UNRESOLVED:** 0
- **VERDICT:** ENG CLEARED — ready to implement