Files

Marius Mutu 696c04c41c chore: normalize line endings from CRLF to LF across all files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-24 01:53:35 +02:00

14 KiB

Raw Blame History

Design: NLP Master Course Audio Pipeline

Generated by /office-hours on 2026-03-23 Branch: unknown Repo: nlp-master (local, no git) Status: APPROVED Mode: Builder

Problem Statement

Marius has an NLP master course hosted at cursuri.aresens.ro/curs/26 with 35 audio recordings (5 modules x 7 lectures, ~95 minutes each, ~58 hours total) in Romanian. The audio is behind a password-protected website. He wants to download all audio files, transcribe them offline using his AMD Radeon RX 6600M 8GB GPU, and generate clean transcripts with per-lecture summaries as study materials.

What Makes This Cool

58 hours of Romanian lecture audio turned into searchable, summarized study materials — completely automated. Download once, transcribe overnight, summarize with Claude Code. A pipeline that would take weeks of manual work happens in hours.

Constraints

Hardware: AMD Radeon RX 6600M 8GB (RDNA2) — no CUDA, needs Vulkan or ROCm
Language: Romanian audio — Whisper large-v3 has decent but not perfect Romanian support (~95% accuracy on clean audio)
Source: Password-protected website at cursuri.aresens.ro/curs/26
Scale: ~35 MP3 files, ~95 min each, ~58 hours total
Privacy: Course content is for personal study use only
Tooling: Claude Code available for summary generation (no separate API cost)
Platform: Native Windows (Python + whisper.cpp + Vulkan). Claude Code runs from WSL2 for summaries.
Summaries language: Romanian (matching source material)
Audio format: MP3, 320kbps, 48kHz stereo, ~218MB per file (verified from sample: "Master 25M1 Z1A [Audio].mp3")

Premises

Legitimate access to the course — downloading audio for personal study is within usage rights
whisper.cpp with Vulkan backend is the right tool for RX 6600M (avoids ROCm compatibility issues on RDNA2)
Audio quality is decent (recorded lectures) — Whisper large-v3 will produce usable Romanian transcripts
Summaries will be generated by Claude Code after transcription — separate step
Batch pipeline (download all → transcribe all → summarize all) is preferred over incremental processing

Approaches Considered

Approach A: Full Pipeline (CHOSEN)

Python script for website login + MP3 download. Shell script for whisper.cpp batch transcription (Vulkan, large-v3-q5_0). Claude Code for per-lecture summaries from transcripts.

Effort: M (human: ~2 days / CC: ~30 min to build, ~8 hours to run transcription)
Risk: Low
Pros: Complete automation, reproducible for module 6, best quality
Cons: whisper.cpp Vulkan build requires system setup

Approach B: Download + Transcribe Only

Same download + transcription, no automated summaries. Simpler but defers the valuable part.

Effort: S (human: ~1 day / CC: ~20 min)
Risk: Low

Approach C: Fully Offline (Local LLM summaries)

Everything offline including summaries via llama.cpp. Zero external costs but lower summary quality.

Effort: M (human: ~2 days / CC: ~40 min)
Risk: Medium (8GB VRAM shared between whisper.cpp and llama.cpp)

Recommended Approach

Approach A: Full Pipeline — Download → whisper.cpp/Vulkan → Claude Code summaries.

Execution model: Everything runs on native Windows (Python, whisper.cpp). Claude Code runs from WSL2 for the summary step.

Step 0: Project Setup

Initialize git repo with .gitignore (exclude: audio/, models/, .env, *.mp3, *.wav, *.bin)
Install Python on Windows (if not already)
Install Vulkan SDK on Windows
Create .env with course credentials (never committed)

Step 1: Site Recon + Download Audio Files

First: Browse cursuri.aresens.ro/curs/26 to understand page structure (login form, module layout, MP3 link format)
Based on recon, write download.py using the right scraping approach (requests+BS4 for static, playwright for JS-rendered — don't build both)
Login with credentials from .env or interactive prompt
Discover all modules dynamically (don't hardcode 5x7 — actual count may vary)
Preserve original file names (e.g., "Master 25M1 Z1A [Audio].mp3") and extract lecture titles
Write manifest.json mapping each file to: module, lecture title, original URL, file path, download status
Resumability: skip already-downloaded files (check existence + file size). Retry 3x with backoff. Log to download_errors.log.
Validation: after download completes, print summary: "Downloaded X/Y files, Z failures. All files > 1MB: pass/fail."

Step 2: Install whisper.cpp with Vulkan (Windows native)

Option A: Download pre-built Windows binary with Vulkan from whisper.cpp-windows-vulkan-bin
Option B: Build from source with Visual Studio + -DGGML_VULKAN=1 CMake flag
Download model: ggml-large-v3-q5_0.bin (~1.5GB) from Hugging Face into models/
VRAM test: transcribe a 2-min clip from the first lecture to verify GPU detection, measure speed, and validate MP3 input works. If MP3 fails (whisper.cpp built without ffmpeg libs), install ffmpeg or pre-convert with Python pydub.
Speed calibration: RX 6600M is roughly half the speed of RX 9070 XT. Realistic estimate: 3-5x realtime (~18-30 min per 90-min file). Total: ~12-18 hours for all files. Plan for a full day, not overnight.
Fallback: if large-v3-q5_0 OOMs on 8GB, try ggml-large-v3-q4_0.bin or ggml-medium-q5_0.bin.

Step 3: Batch Transcription

transcribe.py (Python, cross-platform) reads manifest.json, processes files in module order
Calls whisper.cpp with: --language ro --model models\ggml-large-v3-q5_0.bin --output-txt --output-srt
Output .txt and .srt per file to transcripts/{original_name_without_ext}/
Updates manifest.json with transcription status per file
Resumability: skip files with existing .txt output. Log failures to transcribe_errors.log.
Quality gate: after first module completes (~2 hours), STOP and spot-check 2-3 transcripts. If Romanian accuracy is poor (lots of garbled text), consider: switching to large-v3 unquantized, adjusting --beam-size, or accepting lower quality.
Validation: print summary: "Transcribed X/Y files. Z failures. No empty outputs: pass/fail."

Step 4: Summary Generation with Claude Code

From WSL2, use Claude Code to process each transcript
Use a Python script (summarize.py) that reads manifest.json, opens each .txt file, and prints the summary prompt for Claude Code
Summary prompt (Romanian): "Rezuma aceasta transcriere. Ofera: (1) prezentare generala in 3-5 propozitii, (2) concepte cheie cu definitii, (3) detalii si exemple importante"
Chunking: split transcripts > 10K words at sentence boundaries (not raw word count) with 500-word overlap. Summarize chunks, then merge.
Output to summaries/{original_name}_summary.md
Final: compile SUPORT_CURS.md — master study guide with lecture titles as headings

Manifest Schema

{
  "course": "NLP Master 2025",
  "source_url": "https://cursuri.aresens.ro/curs/26",
  "modules": [
    {
      "name": "Modul 1",
      "lectures": [
        {
          "title": "Master 25M1 Z1A",
          "original_filename": "Master 25M1 Z1A [Audio].mp3",
          "url": "https://...",
          "audio_path": "audio/Master 25M1 Z1A [Audio].mp3",
          "transcript_path": "transcripts/Master 25M1 Z1A.txt",
          "srt_path": "transcripts/Master 25M1 Z1A.srt",
          "summary_path": "summaries/Master 25M1 Z1A_summary.md",
          "download_status": "complete",
          "transcribe_status": "pending",
          "file_size_bytes": 228486429
        }
      ]
    }
  ]
}

Directory Structure

nlp-master/
  .gitignore              # Excludes audio/, models/, .env
  .env                    # Course credentials (not committed)
  manifest.json           # Shared metadata for all scripts
  download.py             # Step 1: site recon + download
  transcribe.py           # Step 3: batch transcription
  summarize.py            # Step 4: summary generation helper
  audio/
    Master 25M1 Z1A [Audio].mp3
    Master 25M1 Z1B [Audio].mp3
    ...
  models/
    ggml-large-v3-q5_0.bin
  transcripts/
    Master 25M1 Z1A.txt
    Master 25M1 Z1A.srt
    ...
  summaries/
    Master 25M1 Z1A_summary.md
    ...
  SUPORT_CURS.md

Open Questions

~~What is the exact website structure?~~ Resolved: browse site first in Step 1.
~~Are there lecture titles on the website?~~ Resolved: preserve original names + extract titles.
~~Do you want the summaries in Romanian or English?~~ Resolved: Romanian.
Should the master study guide (SUPORT_CURS.md) include the full transcripts or just summaries?
Is there a 6th module coming? If so, the pipeline should be easily re-runnable.
Does whisper.cpp Windows binary support MP3 input natively? (Validated in Step 2 VRAM test)

Success Criteria

All ~35 MP3 files downloaded and organized by module
All files transcribed to .txt and .srt with >90% accuracy
Per-lecture summaries generated with key concepts extracted
Master study guide (SUPORT_CURS.md) ready for reading/searching
Pipeline is re-runnable for module 6 when it arrives

Next Steps

git init + .gitignore — set up project, exclude audio/models/.env (~2 min)
Browse cursuri.aresens.ro — understand site structure before coding (~10 min)
Build download.py — login + scrape + download + manifest.json (~15 min with CC)
Install whisper.cpp on Windows — pre-built binary or build from source + Vulkan SDK (~15 min)
Download whisper model — large-v3-q5_0 from Hugging Face (~5 min)
Test transcription — 2-min clip, validate GPU, calibrate speed, check MP3 support (~5 min)
Build transcribe.py — reads manifest, processes in module order, updates status (~10 min with CC)
Run batch transcription — ~12-18 hours (leave running during workday)
Spot-check quality — review 2-3 transcripts after Module 1 completes
Generate summaries with Claude Code — via summarize.py helper (~30 min)
Compile SUPORT_CURS.md — master study guide (~10 min)

NOT in scope

Building a web UI or search interface for transcripts — just flat files
Automated quality scoring of transcriptions — manual spot-check is sufficient
Speaker diarization (identifying different speakers) — single lecturer
Translation to English — summaries stay in Romanian
CI/CD or deployment — this is a local personal pipeline

What already exists

Nothing — greenfield project. No existing code to reuse.
The one existing file (Master 25M1 Z1A [Audio].mp3) confirms the naming pattern and audio specs.

Failure Modes

FAILURE MODE                    | TEST? | HANDLING? | SILENT?
================================|=======|===========|========
Session expires during download | No    | Yes (retry)| No — logged
MP3 truncated (network drop)    | Yes*  | Yes (size) | No — validation
whisper.cpp OOM on large model  | No    | Yes (fallback)| No — logged
whisper.cpp can't read MP3      | No    | No**      | Yes — CRITICAL
Empty transcript output         | Yes*  | Yes (log) | No — validation
Poor Romanian accuracy          | No    | Yes (gate)| No — spot-check
Claude Code input too large     | No    | Yes (chunk)| No — script handles
manifest.json corruption        | No    | No        | Yes — low risk

* = covered by inline validation checks
** = validated in Step 2 test; if fails, install ffmpeg or use pydub

Critical gap: whisper.cpp MP3 support must be validated in Step 2. If it fails silently (produces garbage), the entire batch is wasted.

Eng Review Decisions (2026-03-24)

Hybrid platform → All Windows Python (not WSL2 for scripts)
Browse site first → build the right scraper, not two fallback paths
Preserve original file names + extract lecture titles
Add manifest.json as shared metadata between scripts
Python for all scripts (download.py, transcribe.py, summarize.py)
Built-in validation checks in each script
Feed MP3s directly (no pre-convert)
Process in module order
Realistic transcription estimate: 12-18 hours (not 7-8)

What I noticed about how you think

You said "vreau offline transcription + claude code pentru summaries" — you immediately found the pragmatic middle path between fully offline and fully API-dependent. That's good engineering instinct: use the best tool for each step rather than forcing one tool to do everything.
You gave concrete numbers upfront: "5 module din 6, fiecare cu 7 audio-uri" and "90-100 minute" — you'd already scoped the problem before sitting down. That's not how most people start; most people say "I have some audio files."
You chose "transcripts + summaries" over "just transcripts" or "full study system" — you know what's useful without over-engineering.

GSTACK REVIEW REPORT

Review	Trigger	Why	Runs	Status	Findings
CEO Review	`/plan-ceo-review`	Scope & strategy	0	—	—
Codex Review	`/codex review`	Independent 2nd opinion	0	—	—
Eng Review	`/plan-eng-review`	Architecture & tests (required)	1	CLEAR (PLAN)	8 issues, 0 critical gaps
Design Review	`/plan-design-review`	UI/UX gaps	0	—	—

OUTSIDE VOICE: Claude subagent ran — 10 findings, 3 cross-model tensions resolved (platform execution, speed estimate, module order)
UNRESOLVED: 0
VERDICT: ENG CLEARED — ready to implement

14 KiB Raw Blame History