Compare commits

...

5 Commits

Author SHA1 Message Date
45e72bc89b feat: adauga --modules filter si la download.py
Parametrul din run.bat (ex: 4-5) era transmis doar la transcribe.py.
Acum download.py primeste acelasi filtru si descarca doar modulele
specificate. Sintaxa acceptata: '4-5', '1,3', '1-3,5'.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 02:10:33 +02:00
7b18e8fc41 fix(run.bat): evita Microsoft Store Python stub care termina cmd.exe
Store stub-ul pentru python.exe termina procesul batch cand e apelat
direct. Fix: foloseste 'py' launcher (safe) sau detecteaza python.exe
real via 'where | findstr /v WindowsApps', fara a executa python
in contextul check-ului. Toate apelurile python -> !PYTHON_CMD!.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 02:06:42 +02:00
e83bd74813 feat: switch to CPU-only whisper build (no GPU on this machine)
- setup_whisper.py: descarcă build CPU din release-urile oficiale,
  sare peste Vulkan/CUDA/OpenBLAS
- run.bat: elimină env var GGML_VK_PREFER_HOST_MEMORY și check-ul Vulkan SDK
- transcribe.py: --no-gpu era deja setat

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 02:01:39 +02:00
60f564c107 fix(run.bat): restore CRLF line endings, add .gitattributes
Windows batch files require CRLF — LF-only caused cmd.exe to exit
silently mid-script. .gitattributes ensures *.bat stays CRLF.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 01:55:03 +02:00
696c04c41c chore: normalize line endings from CRLF to LF across all files
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-24 01:53:35 +02:00
11 changed files with 1924 additions and 1916 deletions

7
.gitattributes vendored Normal file
View File

@@ -0,0 +1,7 @@
# Default: LF for all text files
* text=auto eol=lf
# Windows-only files must stay CRLF
*.bat text eol=crlf
*.cmd text eol=crlf
*.ps1 text eol=crlf

68
.gitignore vendored
View File

@@ -1,34 +1,34 @@
# Audio files # Audio files
audio/ audio/
*.mp3 *.mp3
*.wav *.wav
# Whisper models # Whisper models
models/ models/
*.bin *.bin
# Credentials # Credentials
.env .env
# Transcripts and summaries (large generated content) # Transcripts and summaries (large generated content)
transcripts/ transcripts/
summaries/ summaries/
# Binaries (downloaded by setup_whisper.py) # Binaries (downloaded by setup_whisper.py)
whisper-bin/ whisper-bin/
ffmpeg-bin/ ffmpeg-bin/
# Temp files # Temp files
.whisper_bin_path .whisper_bin_path
.ffmpeg_bin_path .ffmpeg_bin_path
# WAV cache (converted from MP3) # WAV cache (converted from MP3)
audio_wav/ audio_wav/
# Python # Python
__pycache__/ __pycache__/
*.pyc *.pyc
.venv/ .venv/
# Logs # Logs
*.log *.log

486
PLAN.md
View File

@@ -1,243 +1,243 @@
# Design: NLP Master Course Audio Pipeline # Design: NLP Master Course Audio Pipeline
Generated by /office-hours on 2026-03-23 Generated by /office-hours on 2026-03-23
Branch: unknown Branch: unknown
Repo: nlp-master (local, no git) Repo: nlp-master (local, no git)
Status: APPROVED Status: APPROVED
Mode: Builder Mode: Builder
## Problem Statement ## Problem Statement
Marius has an NLP master course hosted at cursuri.aresens.ro/curs/26 with 35 audio recordings (5 modules x 7 lectures, ~95 minutes each, ~58 hours total) in Romanian. The audio is behind a password-protected website. He wants to download all audio files, transcribe them offline using his AMD Radeon RX 6600M 8GB GPU, and generate clean transcripts with per-lecture summaries as study materials. Marius has an NLP master course hosted at cursuri.aresens.ro/curs/26 with 35 audio recordings (5 modules x 7 lectures, ~95 minutes each, ~58 hours total) in Romanian. The audio is behind a password-protected website. He wants to download all audio files, transcribe them offline using his AMD Radeon RX 6600M 8GB GPU, and generate clean transcripts with per-lecture summaries as study materials.
## What Makes This Cool ## What Makes This Cool
58 hours of Romanian lecture audio turned into searchable, summarized study materials — completely automated. Download once, transcribe overnight, summarize with Claude Code. A pipeline that would take weeks of manual work happens in hours. 58 hours of Romanian lecture audio turned into searchable, summarized study materials — completely automated. Download once, transcribe overnight, summarize with Claude Code. A pipeline that would take weeks of manual work happens in hours.
## Constraints ## Constraints
- **Hardware:** AMD Radeon RX 6600M 8GB (RDNA2) — no CUDA, needs Vulkan or ROCm - **Hardware:** AMD Radeon RX 6600M 8GB (RDNA2) — no CUDA, needs Vulkan or ROCm
- **Language:** Romanian audio — Whisper large-v3 has decent but not perfect Romanian support (~95% accuracy on clean audio) - **Language:** Romanian audio — Whisper large-v3 has decent but not perfect Romanian support (~95% accuracy on clean audio)
- **Source:** Password-protected website at cursuri.aresens.ro/curs/26 - **Source:** Password-protected website at cursuri.aresens.ro/curs/26
- **Scale:** ~35 MP3 files, ~95 min each, ~58 hours total - **Scale:** ~35 MP3 files, ~95 min each, ~58 hours total
- **Privacy:** Course content is for personal study use only - **Privacy:** Course content is for personal study use only
- **Tooling:** Claude Code available for summary generation (no separate API cost) - **Tooling:** Claude Code available for summary generation (no separate API cost)
- **Platform:** Native Windows (Python + whisper.cpp + Vulkan). Claude Code runs from WSL2 for summaries. - **Platform:** Native Windows (Python + whisper.cpp + Vulkan). Claude Code runs from WSL2 for summaries.
- **Summaries language:** Romanian (matching source material) - **Summaries language:** Romanian (matching source material)
- **Audio format:** MP3, 320kbps, 48kHz stereo, ~218MB per file (verified from sample: "Master 25M1 Z1A [Audio].mp3") - **Audio format:** MP3, 320kbps, 48kHz stereo, ~218MB per file (verified from sample: "Master 25M1 Z1A [Audio].mp3")
## Premises ## Premises
1. Legitimate access to the course — downloading audio for personal study is within usage rights 1. Legitimate access to the course — downloading audio for personal study is within usage rights
2. whisper.cpp with Vulkan backend is the right tool for RX 6600M (avoids ROCm compatibility issues on RDNA2) 2. whisper.cpp with Vulkan backend is the right tool for RX 6600M (avoids ROCm compatibility issues on RDNA2)
3. Audio quality is decent (recorded lectures) — Whisper large-v3 will produce usable Romanian transcripts 3. Audio quality is decent (recorded lectures) — Whisper large-v3 will produce usable Romanian transcripts
4. Summaries will be generated by Claude Code after transcription — separate step 4. Summaries will be generated by Claude Code after transcription — separate step
5. Batch pipeline (download all → transcribe all → summarize all) is preferred over incremental processing 5. Batch pipeline (download all → transcribe all → summarize all) is preferred over incremental processing
## Approaches Considered ## Approaches Considered
### Approach A: Full Pipeline (CHOSEN) ### Approach A: Full Pipeline (CHOSEN)
Python script for website login + MP3 download. Shell script for whisper.cpp batch transcription (Vulkan, large-v3-q5_0). Claude Code for per-lecture summaries from transcripts. Python script for website login + MP3 download. Shell script for whisper.cpp batch transcription (Vulkan, large-v3-q5_0). Claude Code for per-lecture summaries from transcripts.
- Effort: M (human: ~2 days / CC: ~30 min to build, ~8 hours to run transcription) - Effort: M (human: ~2 days / CC: ~30 min to build, ~8 hours to run transcription)
- Risk: Low - Risk: Low
- Pros: Complete automation, reproducible for module 6, best quality - Pros: Complete automation, reproducible for module 6, best quality
- Cons: whisper.cpp Vulkan build requires system setup - Cons: whisper.cpp Vulkan build requires system setup
### Approach B: Download + Transcribe Only ### Approach B: Download + Transcribe Only
Same download + transcription, no automated summaries. Simpler but defers the valuable part. Same download + transcription, no automated summaries. Simpler but defers the valuable part.
- Effort: S (human: ~1 day / CC: ~20 min) - Effort: S (human: ~1 day / CC: ~20 min)
- Risk: Low - Risk: Low
### Approach C: Fully Offline (Local LLM summaries) ### Approach C: Fully Offline (Local LLM summaries)
Everything offline including summaries via llama.cpp. Zero external costs but lower summary quality. Everything offline including summaries via llama.cpp. Zero external costs but lower summary quality.
- Effort: M (human: ~2 days / CC: ~40 min) - Effort: M (human: ~2 days / CC: ~40 min)
- Risk: Medium (8GB VRAM shared between whisper.cpp and llama.cpp) - Risk: Medium (8GB VRAM shared between whisper.cpp and llama.cpp)
## Recommended Approach ## Recommended Approach
**Approach A: Full Pipeline** — Download → whisper.cpp/Vulkan → Claude Code summaries. **Approach A: Full Pipeline** — Download → whisper.cpp/Vulkan → Claude Code summaries.
**Execution model:** Everything runs on native Windows (Python, whisper.cpp). Claude Code runs from WSL2 for the summary step. **Execution model:** Everything runs on native Windows (Python, whisper.cpp). Claude Code runs from WSL2 for the summary step.
### Step 0: Project Setup ### Step 0: Project Setup
- Initialize git repo with `.gitignore` (exclude: `audio/`, `models/`, `.env`, `*.mp3`, `*.wav`, `*.bin`) - Initialize git repo with `.gitignore` (exclude: `audio/`, `models/`, `.env`, `*.mp3`, `*.wav`, `*.bin`)
- Install Python on Windows (if not already) - Install Python on Windows (if not already)
- Install Vulkan SDK on Windows - Install Vulkan SDK on Windows
- Create `.env` with course credentials (never committed) - Create `.env` with course credentials (never committed)
### Step 1: Site Recon + Download Audio Files ### Step 1: Site Recon + Download Audio Files
- **First:** Browse cursuri.aresens.ro/curs/26 to understand page structure (login form, module layout, MP3 link format) - **First:** Browse cursuri.aresens.ro/curs/26 to understand page structure (login form, module layout, MP3 link format)
- Based on recon, write `download.py` using the right scraping approach (requests+BS4 for static, playwright for JS-rendered — don't build both) - Based on recon, write `download.py` using the right scraping approach (requests+BS4 for static, playwright for JS-rendered — don't build both)
- Login with credentials from `.env` or interactive prompt - Login with credentials from `.env` or interactive prompt
- Discover all modules dynamically (don't hardcode 5x7 — actual count may vary) - Discover all modules dynamically (don't hardcode 5x7 — actual count may vary)
- Preserve original file names (e.g., "Master 25M1 Z1A [Audio].mp3") and extract lecture titles - Preserve original file names (e.g., "Master 25M1 Z1A [Audio].mp3") and extract lecture titles
- Write `manifest.json` mapping each file to: module, lecture title, original URL, file path, download status - Write `manifest.json` mapping each file to: module, lecture title, original URL, file path, download status
- **Resumability:** skip already-downloaded files (check existence + file size). Retry 3x with backoff. Log to `download_errors.log`. - **Resumability:** skip already-downloaded files (check existence + file size). Retry 3x with backoff. Log to `download_errors.log`.
- **Validation:** after download completes, print summary: "Downloaded X/Y files, Z failures. All files > 1MB: pass/fail." - **Validation:** after download completes, print summary: "Downloaded X/Y files, Z failures. All files > 1MB: pass/fail."
### Step 2: Install whisper.cpp with Vulkan (Windows native) ### Step 2: Install whisper.cpp with Vulkan (Windows native)
- Option A: Download pre-built Windows binary with Vulkan from [whisper.cpp-windows-vulkan-bin](https://github.com/jerryshell/whisper.cpp-windows-vulkan-bin) - Option A: Download pre-built Windows binary with Vulkan from [whisper.cpp-windows-vulkan-bin](https://github.com/jerryshell/whisper.cpp-windows-vulkan-bin)
- Option B: Build from source with Visual Studio + `-DGGML_VULKAN=1` CMake flag - Option B: Build from source with Visual Studio + `-DGGML_VULKAN=1` CMake flag
- Download model: `ggml-large-v3-q5_0.bin` (~1.5GB) from Hugging Face into `models/` - Download model: `ggml-large-v3-q5_0.bin` (~1.5GB) from Hugging Face into `models/`
- **VRAM test:** transcribe a 2-min clip from the first lecture to verify GPU detection, measure speed, and validate MP3 input works. If MP3 fails (whisper.cpp built without ffmpeg libs), install ffmpeg or pre-convert with Python pydub. - **VRAM test:** transcribe a 2-min clip from the first lecture to verify GPU detection, measure speed, and validate MP3 input works. If MP3 fails (whisper.cpp built without ffmpeg libs), install ffmpeg or pre-convert with Python pydub.
- **Speed calibration:** RX 6600M is roughly half the speed of RX 9070 XT. Realistic estimate: **3-5x realtime** (~18-30 min per 90-min file). Total: **~12-18 hours** for all files. Plan for a full day, not overnight. - **Speed calibration:** RX 6600M is roughly half the speed of RX 9070 XT. Realistic estimate: **3-5x realtime** (~18-30 min per 90-min file). Total: **~12-18 hours** for all files. Plan for a full day, not overnight.
- **Fallback:** if large-v3-q5_0 OOMs on 8GB, try `ggml-large-v3-q4_0.bin` or `ggml-medium-q5_0.bin`. - **Fallback:** if large-v3-q5_0 OOMs on 8GB, try `ggml-large-v3-q4_0.bin` or `ggml-medium-q5_0.bin`.
### Step 3: Batch Transcription ### Step 3: Batch Transcription
- `transcribe.py` (Python, cross-platform) reads `manifest.json`, processes files in module order - `transcribe.py` (Python, cross-platform) reads `manifest.json`, processes files in module order
- Calls whisper.cpp with: `--language ro --model models\ggml-large-v3-q5_0.bin --output-txt --output-srt` - Calls whisper.cpp with: `--language ro --model models\ggml-large-v3-q5_0.bin --output-txt --output-srt`
- Output .txt and .srt per file to `transcripts/{original_name_without_ext}/` - Output .txt and .srt per file to `transcripts/{original_name_without_ext}/`
- Updates `manifest.json` with transcription status per file - Updates `manifest.json` with transcription status per file
- **Resumability:** skip files with existing .txt output. Log failures to `transcribe_errors.log`. - **Resumability:** skip files with existing .txt output. Log failures to `transcribe_errors.log`.
- **Quality gate:** after first module completes (~2 hours), STOP and spot-check 2-3 transcripts. If Romanian accuracy is poor (lots of garbled text), consider: switching to `large-v3` unquantized, adjusting `--beam-size`, or accepting lower quality. - **Quality gate:** after first module completes (~2 hours), STOP and spot-check 2-3 transcripts. If Romanian accuracy is poor (lots of garbled text), consider: switching to `large-v3` unquantized, adjusting `--beam-size`, or accepting lower quality.
- **Validation:** print summary: "Transcribed X/Y files. Z failures. No empty outputs: pass/fail." - **Validation:** print summary: "Transcribed X/Y files. Z failures. No empty outputs: pass/fail."
### Step 4: Summary Generation with Claude Code ### Step 4: Summary Generation with Claude Code
- From WSL2, use Claude Code to process each transcript - From WSL2, use Claude Code to process each transcript
- Use a Python script (`summarize.py`) that reads `manifest.json`, opens each .txt file, and prints the summary prompt for Claude Code - Use a Python script (`summarize.py`) that reads `manifest.json`, opens each .txt file, and prints the summary prompt for Claude Code
- Summary prompt (Romanian): "Rezuma aceasta transcriere. Ofera: (1) prezentare generala in 3-5 propozitii, (2) concepte cheie cu definitii, (3) detalii si exemple importante" - Summary prompt (Romanian): "Rezuma aceasta transcriere. Ofera: (1) prezentare generala in 3-5 propozitii, (2) concepte cheie cu definitii, (3) detalii si exemple importante"
- **Chunking:** split transcripts > 10K words at sentence boundaries (not raw word count) with 500-word overlap. Summarize chunks, then merge. - **Chunking:** split transcripts > 10K words at sentence boundaries (not raw word count) with 500-word overlap. Summarize chunks, then merge.
- Output to `summaries/{original_name}_summary.md` - Output to `summaries/{original_name}_summary.md`
- Final: compile `SUPORT_CURS.md` — master study guide with lecture titles as headings - Final: compile `SUPORT_CURS.md` — master study guide with lecture titles as headings
### Manifest Schema ### Manifest Schema
```json ```json
{ {
"course": "NLP Master 2025", "course": "NLP Master 2025",
"source_url": "https://cursuri.aresens.ro/curs/26", "source_url": "https://cursuri.aresens.ro/curs/26",
"modules": [ "modules": [
{ {
"name": "Modul 1", "name": "Modul 1",
"lectures": [ "lectures": [
{ {
"title": "Master 25M1 Z1A", "title": "Master 25M1 Z1A",
"original_filename": "Master 25M1 Z1A [Audio].mp3", "original_filename": "Master 25M1 Z1A [Audio].mp3",
"url": "https://...", "url": "https://...",
"audio_path": "audio/Master 25M1 Z1A [Audio].mp3", "audio_path": "audio/Master 25M1 Z1A [Audio].mp3",
"transcript_path": "transcripts/Master 25M1 Z1A.txt", "transcript_path": "transcripts/Master 25M1 Z1A.txt",
"srt_path": "transcripts/Master 25M1 Z1A.srt", "srt_path": "transcripts/Master 25M1 Z1A.srt",
"summary_path": "summaries/Master 25M1 Z1A_summary.md", "summary_path": "summaries/Master 25M1 Z1A_summary.md",
"download_status": "complete", "download_status": "complete",
"transcribe_status": "pending", "transcribe_status": "pending",
"file_size_bytes": 228486429 "file_size_bytes": 228486429
} }
] ]
} }
] ]
} }
``` ```
### Directory Structure ### Directory Structure
``` ```
nlp-master/ nlp-master/
.gitignore # Excludes audio/, models/, .env .gitignore # Excludes audio/, models/, .env
.env # Course credentials (not committed) .env # Course credentials (not committed)
manifest.json # Shared metadata for all scripts manifest.json # Shared metadata for all scripts
download.py # Step 1: site recon + download download.py # Step 1: site recon + download
transcribe.py # Step 3: batch transcription transcribe.py # Step 3: batch transcription
summarize.py # Step 4: summary generation helper summarize.py # Step 4: summary generation helper
audio/ audio/
Master 25M1 Z1A [Audio].mp3 Master 25M1 Z1A [Audio].mp3
Master 25M1 Z1B [Audio].mp3 Master 25M1 Z1B [Audio].mp3
... ...
models/ models/
ggml-large-v3-q5_0.bin ggml-large-v3-q5_0.bin
transcripts/ transcripts/
Master 25M1 Z1A.txt Master 25M1 Z1A.txt
Master 25M1 Z1A.srt Master 25M1 Z1A.srt
... ...
summaries/ summaries/
Master 25M1 Z1A_summary.md Master 25M1 Z1A_summary.md
... ...
SUPORT_CURS.md SUPORT_CURS.md
``` ```
## Open Questions ## Open Questions
1. ~~What is the exact website structure?~~ Resolved: browse site first in Step 1. 1. ~~What is the exact website structure?~~ Resolved: browse site first in Step 1.
2. ~~Are there lecture titles on the website?~~ Resolved: preserve original names + extract titles. 2. ~~Are there lecture titles on the website?~~ Resolved: preserve original names + extract titles.
3. ~~Do you want the summaries in Romanian or English?~~ Resolved: Romanian. 3. ~~Do you want the summaries in Romanian or English?~~ Resolved: Romanian.
4. Should the master study guide (SUPORT_CURS.md) include the full transcripts or just summaries? 4. Should the master study guide (SUPORT_CURS.md) include the full transcripts or just summaries?
5. Is there a 6th module coming? If so, the pipeline should be easily re-runnable. 5. Is there a 6th module coming? If so, the pipeline should be easily re-runnable.
6. Does whisper.cpp Windows binary support MP3 input natively? (Validated in Step 2 VRAM test) 6. Does whisper.cpp Windows binary support MP3 input natively? (Validated in Step 2 VRAM test)
## Success Criteria ## Success Criteria
- All ~35 MP3 files downloaded and organized by module - All ~35 MP3 files downloaded and organized by module
- All files transcribed to .txt and .srt with >90% accuracy - All files transcribed to .txt and .srt with >90% accuracy
- Per-lecture summaries generated with key concepts extracted - Per-lecture summaries generated with key concepts extracted
- Master study guide (SUPORT_CURS.md) ready for reading/searching - Master study guide (SUPORT_CURS.md) ready for reading/searching
- Pipeline is re-runnable for module 6 when it arrives - Pipeline is re-runnable for module 6 when it arrives
## Next Steps ## Next Steps
1. **git init + .gitignore** — set up project, exclude audio/models/.env (~2 min) 1. **git init + .gitignore** — set up project, exclude audio/models/.env (~2 min)
2. **Browse cursuri.aresens.ro** — understand site structure before coding (~10 min) 2. **Browse cursuri.aresens.ro** — understand site structure before coding (~10 min)
3. **Build download.py** — login + scrape + download + manifest.json (~15 min with CC) 3. **Build download.py** — login + scrape + download + manifest.json (~15 min with CC)
4. **Install whisper.cpp on Windows** — pre-built binary or build from source + Vulkan SDK (~15 min) 4. **Install whisper.cpp on Windows** — pre-built binary or build from source + Vulkan SDK (~15 min)
5. **Download whisper model** — large-v3-q5_0 from Hugging Face (~5 min) 5. **Download whisper model** — large-v3-q5_0 from Hugging Face (~5 min)
6. **Test transcription** — 2-min clip, validate GPU, calibrate speed, check MP3 support (~5 min) 6. **Test transcription** — 2-min clip, validate GPU, calibrate speed, check MP3 support (~5 min)
7. **Build transcribe.py** — reads manifest, processes in module order, updates status (~10 min with CC) 7. **Build transcribe.py** — reads manifest, processes in module order, updates status (~10 min with CC)
8. **Run batch transcription** — ~12-18 hours (leave running during workday) 8. **Run batch transcription** — ~12-18 hours (leave running during workday)
9. **Spot-check quality** — review 2-3 transcripts after Module 1 completes 9. **Spot-check quality** — review 2-3 transcripts after Module 1 completes
10. **Generate summaries with Claude Code** — via summarize.py helper (~30 min) 10. **Generate summaries with Claude Code** — via summarize.py helper (~30 min)
11. **Compile SUPORT_CURS.md** — master study guide (~10 min) 11. **Compile SUPORT_CURS.md** — master study guide (~10 min)
## NOT in scope ## NOT in scope
- Building a web UI or search interface for transcripts — just flat files - Building a web UI or search interface for transcripts — just flat files
- Automated quality scoring of transcriptions — manual spot-check is sufficient - Automated quality scoring of transcriptions — manual spot-check is sufficient
- Speaker diarization (identifying different speakers) — single lecturer - Speaker diarization (identifying different speakers) — single lecturer
- Translation to English — summaries stay in Romanian - Translation to English — summaries stay in Romanian
- CI/CD or deployment — this is a local personal pipeline - CI/CD or deployment — this is a local personal pipeline
## What already exists ## What already exists
- Nothing — greenfield project. No existing code to reuse. - Nothing — greenfield project. No existing code to reuse.
- The one existing file (`Master 25M1 Z1A [Audio].mp3`) confirms the naming pattern and audio specs. - The one existing file (`Master 25M1 Z1A [Audio].mp3`) confirms the naming pattern and audio specs.
## Failure Modes ## Failure Modes
``` ```
FAILURE MODE | TEST? | HANDLING? | SILENT? FAILURE MODE | TEST? | HANDLING? | SILENT?
================================|=======|===========|======== ================================|=======|===========|========
Session expires during download | No | Yes (retry)| No — logged Session expires during download | No | Yes (retry)| No — logged
MP3 truncated (network drop) | Yes* | Yes (size) | No — validation MP3 truncated (network drop) | Yes* | Yes (size) | No — validation
whisper.cpp OOM on large model | No | Yes (fallback)| No — logged whisper.cpp OOM on large model | No | Yes (fallback)| No — logged
whisper.cpp can't read MP3 | No | No** | Yes — CRITICAL whisper.cpp can't read MP3 | No | No** | Yes — CRITICAL
Empty transcript output | Yes* | Yes (log) | No — validation Empty transcript output | Yes* | Yes (log) | No — validation
Poor Romanian accuracy | No | Yes (gate)| No — spot-check Poor Romanian accuracy | No | Yes (gate)| No — spot-check
Claude Code input too large | No | Yes (chunk)| No — script handles Claude Code input too large | No | Yes (chunk)| No — script handles
manifest.json corruption | No | No | Yes — low risk manifest.json corruption | No | No | Yes — low risk
* = covered by inline validation checks * = covered by inline validation checks
** = validated in Step 2 test; if fails, install ffmpeg or use pydub ** = validated in Step 2 test; if fails, install ffmpeg or use pydub
``` ```
**Critical gap:** whisper.cpp MP3 support must be validated in Step 2. If it fails silently (produces garbage), the entire batch is wasted. **Critical gap:** whisper.cpp MP3 support must be validated in Step 2. If it fails silently (produces garbage), the entire batch is wasted.
## Eng Review Decisions (2026-03-24) ## Eng Review Decisions (2026-03-24)
1. Hybrid platform → **All Windows Python** (not WSL2 for scripts) 1. Hybrid platform → **All Windows Python** (not WSL2 for scripts)
2. Browse site first → build the right scraper, not two fallback paths 2. Browse site first → build the right scraper, not two fallback paths
3. Preserve original file names + extract lecture titles 3. Preserve original file names + extract lecture titles
4. Add manifest.json as shared metadata between scripts 4. Add manifest.json as shared metadata between scripts
5. Python for all scripts (download.py, transcribe.py, summarize.py) 5. Python for all scripts (download.py, transcribe.py, summarize.py)
6. Built-in validation checks in each script 6. Built-in validation checks in each script
7. Feed MP3s directly (no pre-convert) 7. Feed MP3s directly (no pre-convert)
8. Process in module order 8. Process in module order
9. Realistic transcription estimate: 12-18 hours (not 7-8) 9. Realistic transcription estimate: 12-18 hours (not 7-8)
## What I noticed about how you think ## What I noticed about how you think
- You said "vreau offline transcription + claude code pentru summaries" — you immediately found the pragmatic middle path between fully offline and fully API-dependent. That's good engineering instinct: use the best tool for each step rather than forcing one tool to do everything. - You said "vreau offline transcription + claude code pentru summaries" — you immediately found the pragmatic middle path between fully offline and fully API-dependent. That's good engineering instinct: use the best tool for each step rather than forcing one tool to do everything.
- You gave concrete numbers upfront: "5 module din 6, fiecare cu 7 audio-uri" and "90-100 minute" — you'd already scoped the problem before sitting down. That's not how most people start; most people say "I have some audio files." - You gave concrete numbers upfront: "5 module din 6, fiecare cu 7 audio-uri" and "90-100 minute" — you'd already scoped the problem before sitting down. That's not how most people start; most people say "I have some audio files."
- You chose "transcripts + summaries" over "just transcripts" or "full study system" — you know what's useful without over-engineering. - You chose "transcripts + summaries" over "just transcripts" or "full study system" — you know what's useful without over-engineering.
## GSTACK REVIEW REPORT ## GSTACK REVIEW REPORT
| Review | Trigger | Why | Runs | Status | Findings | | Review | Trigger | Why | Runs | Status | Findings |
|--------|---------|-----|------|--------|----------| |--------|---------|-----|------|--------|----------|
| CEO Review | `/plan-ceo-review` | Scope & strategy | 0 | — | — | | CEO Review | `/plan-ceo-review` | Scope & strategy | 0 | — | — |
| Codex Review | `/codex review` | Independent 2nd opinion | 0 | — | — | | Codex Review | `/codex review` | Independent 2nd opinion | 0 | — | — |
| Eng Review | `/plan-eng-review` | Architecture & tests (required) | 1 | CLEAR (PLAN) | 8 issues, 0 critical gaps | | Eng Review | `/plan-eng-review` | Architecture & tests (required) | 1 | CLEAR (PLAN) | 8 issues, 0 critical gaps |
| Design Review | `/plan-design-review` | UI/UX gaps | 0 | — | — | | Design Review | `/plan-design-review` | UI/UX gaps | 0 | — | — |
- **OUTSIDE VOICE:** Claude subagent ran — 10 findings, 3 cross-model tensions resolved (platform execution, speed estimate, module order) - **OUTSIDE VOICE:** Claude subagent ran — 10 findings, 3 cross-model tensions resolved (platform execution, speed estimate, module order)
- **UNRESOLVED:** 0 - **UNRESOLVED:** 0
- **VERDICT:** ENG CLEARED — ready to implement - **VERDICT:** ENG CLEARED — ready to implement

View File

@@ -1,8 +1,8 @@
# TODOS # TODOS
## Re-run pipeline for Module 6 ## Re-run pipeline for Module 6
- **What:** Re-run `download.py` when module 6 becomes available on cursuri.aresens.ro/curs/26 - **What:** Re-run `download.py` when module 6 becomes available on cursuri.aresens.ro/curs/26
- **Why:** Course has 6 modules total, only 5 are currently available. Pipeline is designed to be re-runnable — manifest.json + resumability means it discovers new modules and skips already-downloaded files. - **Why:** Course has 6 modules total, only 5 are currently available. Pipeline is designed to be re-runnable — manifest.json + resumability means it discovers new modules and skips already-downloaded files.
- **How:** Run `python download.py` → check manifest for new files → run `python transcribe.py` → generate summaries → update SUPORT_CURS.md - **How:** Run `python download.py` → check manifest for new files → run `python transcribe.py` → generate summaries → update SUPORT_CURS.md
- **Depends on:** Course provider publishing module 6 - **Depends on:** Course provider publishing module 6
- **Added:** 2026-03-24 - **Added:** 2026-03-24

View File

@@ -1,253 +1,277 @@
""" """
Download all audio files from cursuri.aresens.ro NLP Master course. Download all audio files from cursuri.aresens.ro NLP Master course.
Logs in, discovers modules and lectures, downloads MP3s, writes manifest.json. Logs in, discovers modules and lectures, downloads MP3s, writes manifest.json.
Resumable: skips already-downloaded files. Resumable: skips already-downloaded files.
""" """
import json import json
import logging import logging
import os import os
import sys import sys
import time import time
from pathlib import Path from pathlib import Path
from urllib.parse import urljoin from urllib.parse import urljoin
import requests import requests
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from dotenv import load_dotenv from dotenv import load_dotenv
BASE_URL = "https://cursuri.aresens.ro" BASE_URL = "https://cursuri.aresens.ro"
COURSE_URL = f"{BASE_URL}/curs/26" COURSE_URL = f"{BASE_URL}/curs/26"
LOGIN_URL = f"{BASE_URL}/login" LOGIN_URL = f"{BASE_URL}/login"
AUDIO_DIR = Path("audio") AUDIO_DIR = Path("audio")
MANIFEST_PATH = Path("manifest.json") MANIFEST_PATH = Path("manifest.json")
MAX_RETRIES = 3 MAX_RETRIES = 3
RETRY_BACKOFF = [5, 15, 30] RETRY_BACKOFF = [5, 15, 30]
logging.basicConfig( logging.basicConfig(
level=logging.INFO, level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s", format="%(asctime)s [%(levelname)s] %(message)s",
handlers=[ handlers=[
logging.StreamHandler(), logging.StreamHandler(),
logging.FileHandler("download_errors.log"), logging.FileHandler("download_errors.log"),
], ],
) )
log = logging.getLogger(__name__) log = logging.getLogger(__name__)
def login(session: requests.Session, email: str, password: str) -> bool: def login(session: requests.Session, email: str, password: str) -> bool:
"""Login and return True on success.""" """Login and return True on success."""
resp = session.post(LOGIN_URL, data={ resp = session.post(LOGIN_URL, data={
"email": email, "email": email,
"password": password, "password": password,
"act": "login", "act": "login",
"remember": "on", "remember": "on",
}, allow_redirects=True) }, allow_redirects=True)
# Successful login redirects to the course page, not back to /login # Successful login redirects to the course page, not back to /login
if "/login" in resp.url or "loginform" in resp.text: if "/login" in resp.url or "loginform" in resp.text:
return False return False
return True return True
def discover_modules(session: requests.Session) -> list[dict]: def parse_module_filter(arg: str) -> set[int]:
"""Fetch course page and return list of {name, url, module_id}.""" """Parse module filter like '1-3' or '4,5' or '1-3,5' into a set of 1-based indices."""
resp = session.get(COURSE_URL) result = set()
resp.raise_for_status() for part in arg.split(","):
soup = BeautifulSoup(resp.text, "html.parser") part = part.strip()
if "-" in part:
modules = [] a, b = part.split("-", 1)
for div in soup.select("div.module"): result.update(range(int(a), int(b) + 1))
number_el = div.select_one("div.module__number") else:
link_el = div.select_one("a.btn") result.add(int(part))
if not number_el or not link_el: return result
continue
href = link_el.get("href", "")
module_id = href.rstrip("/").split("/")[-1] def discover_modules(session: requests.Session) -> list[dict]:
modules.append({ """Fetch course page and return list of {name, url, module_id}."""
"name": number_el.get_text(strip=True), resp = session.get(COURSE_URL)
"url": urljoin(BASE_URL, href), resp.raise_for_status()
"module_id": module_id, soup = BeautifulSoup(resp.text, "html.parser")
})
log.info(f"Found {len(modules)} modules") modules = []
return modules for div in soup.select("div.module"):
number_el = div.select_one("div.module__number")
link_el = div.select_one("a.btn")
def discover_lectures(session: requests.Session, module: dict) -> list[dict]: if not number_el or not link_el:
"""Fetch a module page and return list of lectures with audio URLs.""" continue
resp = session.get(module["url"]) href = link_el.get("href", "")
resp.raise_for_status() module_id = href.rstrip("/").split("/")[-1]
soup = BeautifulSoup(resp.text, "html.parser") modules.append({
"name": number_el.get_text(strip=True),
lectures = [] "url": urljoin(BASE_URL, href),
for lesson_div in soup.select("div.lesson"): "module_id": module_id,
name_el = lesson_div.select_one("div.module__name") })
source_el = lesson_div.select_one("audio source") log.info(f"Found {len(modules)} modules")
if not name_el or not source_el: return modules
continue
src = source_el.get("src", "").strip()
if not src: def discover_lectures(session: requests.Session, module: dict) -> list[dict]:
continue """Fetch a module page and return list of lectures with audio URLs."""
audio_url = urljoin(BASE_URL, src) resp = session.get(module["url"])
filename = src.split("/")[-1] resp.raise_for_status()
title = name_el.get_text(strip=True) soup = BeautifulSoup(resp.text, "html.parser")
lectures.append({
"title": title, lectures = []
"original_filename": filename, for lesson_div in soup.select("div.lesson"):
"url": audio_url, name_el = lesson_div.select_one("div.module__name")
"audio_path": str(AUDIO_DIR / filename), source_el = lesson_div.select_one("audio source")
}) if not name_el or not source_el:
log.info(f" {module['name']}: {len(lectures)} lectures") continue
return lectures src = source_el.get("src", "").strip()
if not src:
continue
def download_file(session: requests.Session, url: str, dest: Path) -> bool: audio_url = urljoin(BASE_URL, src)
"""Download a file with retry logic. Returns True on success.""" filename = src.split("/")[-1]
for attempt in range(MAX_RETRIES): title = name_el.get_text(strip=True)
try: lectures.append({
resp = session.get(url, stream=True, timeout=300) "title": title,
resp.raise_for_status() "original_filename": filename,
"url": audio_url,
# Write to temp file first, then rename (atomic) "audio_path": str(AUDIO_DIR / filename),
tmp = dest.with_suffix(".tmp") })
total = 0 log.info(f" {module['name']}: {len(lectures)} lectures")
with open(tmp, "wb") as f: return lectures
for chunk in resp.iter_content(chunk_size=1024 * 1024):
f.write(chunk)
total += len(chunk) def download_file(session: requests.Session, url: str, dest: Path) -> bool:
"""Download a file with retry logic. Returns True on success."""
if total < 1_000_000: # < 1MB is suspicious for attempt in range(MAX_RETRIES):
log.warning(f"File too small ({total} bytes): {dest.name}") try:
tmp.unlink(missing_ok=True) resp = session.get(url, stream=True, timeout=300)
return False resp.raise_for_status()
tmp.rename(dest) # Write to temp file first, then rename (atomic)
log.info(f" Downloaded: {dest.name} ({total / 1_000_000:.1f} MB)") tmp = dest.with_suffix(".tmp")
return True total = 0
with open(tmp, "wb") as f:
except Exception as e: for chunk in resp.iter_content(chunk_size=1024 * 1024):
wait = RETRY_BACKOFF[attempt] if attempt < len(RETRY_BACKOFF) else 30 f.write(chunk)
log.warning(f" Attempt {attempt + 1}/{MAX_RETRIES} failed for {dest.name}: {e}") total += len(chunk)
if attempt < MAX_RETRIES - 1:
log.info(f" Retrying in {wait}s...") if total < 1_000_000: # < 1MB is suspicious
time.sleep(wait) log.warning(f"File too small ({total} bytes): {dest.name}")
tmp.unlink(missing_ok=True)
log.error(f" FAILED after {MAX_RETRIES} attempts: {dest.name}") return False
return False
tmp.rename(dest)
log.info(f" Downloaded: {dest.name} ({total / 1_000_000:.1f} MB)")
def load_manifest() -> dict | None: return True
"""Load existing manifest if present."""
if MANIFEST_PATH.exists(): except Exception as e:
with open(MANIFEST_PATH) as f: wait = RETRY_BACKOFF[attempt] if attempt < len(RETRY_BACKOFF) else 30
return json.load(f) log.warning(f" Attempt {attempt + 1}/{MAX_RETRIES} failed for {dest.name}: {e}")
return None if attempt < MAX_RETRIES - 1:
log.info(f" Retrying in {wait}s...")
time.sleep(wait)
def save_manifest(manifest: dict):
"""Write manifest.json.""" log.error(f" FAILED after {MAX_RETRIES} attempts: {dest.name}")
with open(MANIFEST_PATH, "w", encoding="utf-8") as f: return False
json.dump(manifest, f, indent=2, ensure_ascii=False)
def load_manifest() -> dict | None:
def main(): """Load existing manifest if present."""
load_dotenv() if MANIFEST_PATH.exists():
email = os.getenv("COURSE_USERNAME", "") with open(MANIFEST_PATH) as f:
password = os.getenv("COURSE_PASSWORD", "") return json.load(f)
if not email or not password: return None
log.error("Set COURSE_USERNAME and COURSE_PASSWORD in .env")
sys.exit(1)
def save_manifest(manifest: dict):
AUDIO_DIR.mkdir(exist_ok=True) """Write manifest.json."""
with open(MANIFEST_PATH, "w", encoding="utf-8") as f:
session = requests.Session() json.dump(manifest, f, indent=2, ensure_ascii=False)
session.headers.update({"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"})
log.info("Logging in...") def main():
if not login(session, email, password): load_dotenv()
log.error("Login failed. Check credentials in .env") email = os.getenv("COURSE_USERNAME", "")
sys.exit(1) password = os.getenv("COURSE_PASSWORD", "")
log.info("Login successful") if not email or not password:
log.error("Set COURSE_USERNAME and COURSE_PASSWORD in .env")
# Discover structure sys.exit(1)
modules = discover_modules(session)
if not modules: # Parse --modules filter (e.g. "4-5" or "1,3,5")
log.error("No modules found") module_filter = None
sys.exit(1) if "--modules" in sys.argv:
idx = sys.argv.index("--modules")
manifest = { if idx + 1 < len(sys.argv):
"course": "NLP Master Practitioner Bucuresti 2025", module_filter = parse_module_filter(sys.argv[idx + 1])
"source_url": COURSE_URL, log.info(f"Module filter: {sorted(module_filter)}")
"modules": [],
} AUDIO_DIR.mkdir(exist_ok=True)
total_files = 0 session = requests.Session()
downloaded = 0 session.headers.update({"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"})
skipped = 0
failed = 0 log.info("Logging in...")
if not login(session, email, password):
for mod in modules: log.error("Login failed. Check credentials in .env")
lectures = discover_lectures(session, mod) sys.exit(1)
module_entry = { log.info("Login successful")
"name": mod["name"],
"module_id": mod["module_id"], # Discover structure
"lectures": [], modules = discover_modules(session)
} if not modules:
log.error("No modules found")
for lec in lectures: sys.exit(1)
total_files += 1
dest = Path(lec["audio_path"]) manifest = {
stem = dest.stem.replace(" [Audio]", "") "course": "NLP Master Practitioner Bucuresti 2025",
"source_url": COURSE_URL,
lecture_entry = { "modules": [],
"title": lec["title"], }
"original_filename": lec["original_filename"],
"url": lec["url"], total_files = 0
"audio_path": lec["audio_path"], downloaded = 0
"transcript_path": f"transcripts/{stem}.txt", skipped = 0
"srt_path": f"transcripts/{stem}.srt", failed = 0
"summary_path": f"summaries/{stem}_summary.md",
"download_status": "pending", for mod_idx, mod in enumerate(modules, 1):
"transcribe_status": "pending", if module_filter and mod_idx not in module_filter:
"file_size_bytes": 0, log.info(f" Skipping module {mod_idx}: {mod['name']}")
} continue
lectures = discover_lectures(session, mod)
# Skip if already downloaded module_entry = {
if dest.exists() and dest.stat().st_size > 1_000_000: "name": mod["name"],
lecture_entry["download_status"] = "complete" "module_id": mod["module_id"],
lecture_entry["file_size_bytes"] = dest.stat().st_size "lectures": [],
skipped += 1 }
log.info(f" Skipping (exists): {dest.name}")
else: for lec in lectures:
if download_file(session, lec["url"], dest): total_files += 1
lecture_entry["download_status"] = "complete" dest = Path(lec["audio_path"])
lecture_entry["file_size_bytes"] = dest.stat().st_size stem = dest.stem.replace(" [Audio]", "")
downloaded += 1
else: lecture_entry = {
lecture_entry["download_status"] = "failed" "title": lec["title"],
failed += 1 "original_filename": lec["original_filename"],
"url": lec["url"],
module_entry["lectures"].append(lecture_entry) "audio_path": lec["audio_path"],
"transcript_path": f"transcripts/{stem}.txt",
manifest["modules"].append(module_entry) "srt_path": f"transcripts/{stem}.srt",
# Save manifest after each module (checkpoint) "summary_path": f"summaries/{stem}_summary.md",
save_manifest(manifest) "download_status": "pending",
"transcribe_status": "pending",
# Final validation "file_size_bytes": 0,
all_ok = all( }
Path(lec["audio_path"]).exists() and Path(lec["audio_path"]).stat().st_size > 1_000_000
for mod in manifest["modules"] # Skip if already downloaded
for lec in mod["lectures"] if dest.exists() and dest.stat().st_size > 1_000_000:
if lec["download_status"] == "complete" lecture_entry["download_status"] = "complete"
) lecture_entry["file_size_bytes"] = dest.stat().st_size
skipped += 1
log.info("=" * 60) log.info(f" Skipping (exists): {dest.name}")
log.info(f"Downloaded {downloaded}/{total_files} files, {skipped} skipped, {failed} failures.") else:
log.info(f"All files > 1MB: {'PASS' if all_ok else 'FAIL'}") if download_file(session, lec["url"], dest):
log.info("=" * 60) lecture_entry["download_status"] = "complete"
lecture_entry["file_size_bytes"] = dest.stat().st_size
if failed: downloaded += 1
sys.exit(1) else:
lecture_entry["download_status"] = "failed"
failed += 1
if __name__ == "__main__":
main() module_entry["lectures"].append(lecture_entry)
manifest["modules"].append(module_entry)
# Save manifest after each module (checkpoint)
save_manifest(manifest)
# Final validation
all_ok = all(
Path(lec["audio_path"]).exists() and Path(lec["audio_path"]).stat().st_size > 1_000_000
for mod in manifest["modules"]
for lec in mod["lectures"]
if lec["download_status"] == "complete"
)
log.info("=" * 60)
log.info(f"Downloaded {downloaded}/{total_files} files, {skipped} skipped, {failed} failures.")
log.info(f"All files > 1MB: {'PASS' if all_ok else 'FAIL'}")
log.info("=" * 60)
if failed:
sys.exit(1)
if __name__ == "__main__":
main()

File diff suppressed because it is too large Load Diff

View File

@@ -1,3 +1,3 @@
requests requests
beautifulsoup4 beautifulsoup4
python-dotenv python-dotenv

56
run.bat
View File

@@ -2,8 +2,6 @@
setlocal enabledelayedexpansion setlocal enabledelayedexpansion
cd /d "%~dp0" cd /d "%~dp0"
:: Prevent Vulkan from exhausting VRAM — overflow to system RAM instead of crashing
set "GGML_VK_PREFER_HOST_MEMORY=ON"
echo ============================================================ echo ============================================================
echo NLP Master - Download + Transcribe Pipeline echo NLP Master - Download + Transcribe Pipeline
@@ -20,17 +18,31 @@ set "NEED_WHISPER="
set "NEED_MODEL=" set "NEED_MODEL="
:: --- Python --- :: --- Python ---
python --version >nul 2>&1 :: Avoid executing python.exe directly — the Microsoft Store stub terminates cmd.exe.
if errorlevel 1 ( :: Use 'py' launcher first (safe), then find python.exe excluding WindowsApps stub.
set "PYTHON_CMD="
where py >nul 2>&1
if not errorlevel 1 (
set "PYTHON_CMD=py"
for /f "tokens=2" %%v in ('py --version 2^>^&1') do echo [OK] Python %%v (py launcher^)
)
if not defined PYTHON_CMD (
for /f "delims=" %%p in ('where python 2^>nul ^| findstr /v /i "WindowsApps"') do (
if not defined PYTHON_CMD (
set "PYTHON_CMD=%%p"
for /f "tokens=2" %%v in ('"%%p" --version 2^>^&1') do echo [OK] Python %%v
)
)
)
if not defined PYTHON_CMD (
echo [X] Python NOT FOUND echo [X] Python NOT FOUND
echo The Microsoft Store stub does not count as a real Python install.
echo Install from: https://www.python.org/downloads/ echo Install from: https://www.python.org/downloads/
echo Make sure to check "Add Python to PATH" during install. echo Make sure to check "Add Python to PATH" during install.
echo. echo.
echo Cannot continue without Python. Install it and re-run. echo Cannot continue without Python. Install it and re-run.
pause pause
exit /b 1 exit /b 1
) else (
for /f "tokens=2" %%v in ('python --version 2^>^&1') do echo [OK] Python %%v
) )
:: --- .env credentials --- :: --- .env credentials ---
@@ -126,21 +138,6 @@ if exist "%WHISPER_MODEL%" (
set "NEED_MODEL=1" set "NEED_MODEL=1"
) )
:: --- Vulkan GPU support ---
set "VULKAN_FOUND="
where vulkaninfo >nul 2>&1
if not errorlevel 1 (
set "VULKAN_FOUND=1"
echo [OK] Vulkan SDK Installed
) else (
if exist "%VULKAN_SDK%\Bin\vulkaninfo.exe" (
set "VULKAN_FOUND=1"
echo [OK] Vulkan SDK %VULKAN_SDK%
) else (
echo [!!] Vulkan SDK Not detected (whisper.cpp may use CPU fallback^)
echo Install from: https://vulkan.lunarg.com/sdk/home
)
)
:: --- Disk space --- :: --- Disk space ---
echo. echo.
@@ -173,7 +170,7 @@ if defined NEED_FFMPEG (
echo ============================================================ echo ============================================================
echo Auto-downloading ffmpeg... echo Auto-downloading ffmpeg...
echo ============================================================ echo ============================================================
python setup_whisper.py ffmpeg "!PYTHON_CMD!" setup_whisper.py ffmpeg
if errorlevel 1 ( if errorlevel 1 (
echo. echo.
echo ERROR: Could not install ffmpeg. echo ERROR: Could not install ffmpeg.
@@ -193,9 +190,9 @@ if exist "ffmpeg-bin\ffmpeg.exe" (
if defined NEED_WHISPER ( if defined NEED_WHISPER (
echo ============================================================ echo ============================================================
echo Auto-downloading whisper.cpp (Vulkan build^)... echo Auto-downloading whisper.cpp (CPU build^)...
echo ============================================================ echo ============================================================
python setup_whisper.py whisper "!PYTHON_CMD!" setup_whisper.py whisper
if errorlevel 1 ( if errorlevel 1 (
echo. echo.
echo ERROR: Failed to auto-download whisper.cpp. echo ERROR: Failed to auto-download whisper.cpp.
@@ -217,7 +214,7 @@ if defined NEED_MODEL (
echo Auto-downloading Whisper model (ggml-medium-q5_0, ~500 MB^)... echo Auto-downloading Whisper model (ggml-medium-q5_0, ~500 MB^)...
echo This will take a few minutes depending on your connection. echo This will take a few minutes depending on your connection.
echo ============================================================ echo ============================================================
python setup_whisper.py model "!PYTHON_CMD!" setup_whisper.py model
if errorlevel 1 ( if errorlevel 1 (
echo. echo.
echo ERROR: Failed to download model. echo ERROR: Failed to download model.
@@ -242,7 +239,7 @@ echo.
:: ============================================================ :: ============================================================
if not exist ".venv\Scripts\python.exe" ( if not exist ".venv\Scripts\python.exe" (
echo [1/4] Creating Python virtual environment... echo [1/4] Creating Python virtual environment...
python -m venv .venv "!PYTHON_CMD!" -m venv .venv
if errorlevel 1 ( if errorlevel 1 (
echo ERROR: Failed to create venv. echo ERROR: Failed to create venv.
pause pause
@@ -268,7 +265,12 @@ echo Done.
echo. echo.
echo [3/4] Downloading audio files... echo [3/4] Downloading audio files...
echo ============================================================ echo ============================================================
.venv\Scripts\python download.py if "%~1"=="" (
.venv\Scripts\python download.py
) else (
echo Modules filter: %~1
.venv\Scripts\python download.py --modules %~1
)
if errorlevel 1 ( if errorlevel 1 (
echo. echo.
echo WARNING: Some downloads failed. Check download_errors.log echo WARNING: Some downloads failed. Check download_errors.log

View File

@@ -1,325 +1,300 @@
""" """
Auto-download and setup whisper.cpp (Vulkan) + model for Windows. Auto-download and setup whisper.cpp (CPU build) + model for Windows.
Called by run.bat when prerequisites are missing. Called by run.bat when prerequisites are missing.
""" """
import io import io
import json import json
import os import os
import sys import sys
import zipfile import zipfile
from pathlib import Path from pathlib import Path
from urllib.request import urlopen, Request from urllib.request import urlopen, Request
MODELS_DIR = Path("models") MODELS_DIR = Path("models")
MODEL_NAME = "ggml-medium-q5_0.bin" MODEL_NAME = "ggml-medium-q5_0.bin"
MODEL_URL = f"https://huggingface.co/ggerganov/whisper.cpp/resolve/main/{MODEL_NAME}" MODEL_URL = f"https://huggingface.co/ggerganov/whisper.cpp/resolve/main/{MODEL_NAME}"
GITHUB_API = "https://api.github.com/repos/ggml-org/whisper.cpp/releases/latest" GITHUB_API = "https://api.github.com/repos/ggml-org/whisper.cpp/releases/latest"
# Community Vulkan builds (for AMD GPUs) # Community Vulkan builds (for AMD GPUs)
VULKAN_BUILDS_API = "https://api.github.com/repos/jerryshell/whisper.cpp-windows-vulkan-bin/releases/latest" VULKAN_BUILDS_API = "https://api.github.com/repos/jerryshell/whisper.cpp-windows-vulkan-bin/releases/latest"
WHISPER_DIR = Path("whisper-bin") WHISPER_DIR = Path("whisper-bin")
def progress_bar(current: int, total: int, width: int = 40): def progress_bar(current: int, total: int, width: int = 40):
if total <= 0: if total <= 0:
return return
pct = current / total pct = current / total
filled = int(width * pct) filled = int(width * pct)
bar = "=" * filled + "-" * (width - filled) bar = "=" * filled + "-" * (width - filled)
mb_done = current / 1_048_576 mb_done = current / 1_048_576
mb_total = total / 1_048_576 mb_total = total / 1_048_576
print(f"\r [{bar}] {pct:.0%} {mb_done:.0f}/{mb_total:.0f} MB", end="", flush=True) print(f"\r [{bar}] {pct:.0%} {mb_done:.0f}/{mb_total:.0f} MB", end="", flush=True)
def download_file(url: str, dest: Path, desc: str): def download_file(url: str, dest: Path, desc: str):
"""Download a file with progress bar.""" """Download a file with progress bar."""
print(f"\n Downloading {desc}...") print(f"\n Downloading {desc}...")
print(f" URL: {url}") print(f" URL: {url}")
req = Request(url, headers={"User-Agent": "Mozilla/5.0"}) req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
resp = urlopen(req, timeout=60) resp = urlopen(req, timeout=60)
total = int(resp.headers.get("Content-Length", 0)) total = int(resp.headers.get("Content-Length", 0))
downloaded = 0 downloaded = 0
tmp = dest.with_suffix(".tmp") tmp = dest.with_suffix(".tmp")
with open(tmp, "wb") as f: with open(tmp, "wb") as f:
while True: while True:
chunk = resp.read(1024 * 1024) chunk = resp.read(1024 * 1024)
if not chunk: if not chunk:
break break
f.write(chunk) f.write(chunk)
downloaded += len(chunk) downloaded += len(chunk)
progress_bar(downloaded, total) progress_bar(downloaded, total)
print() # newline after progress bar print() # newline after progress bar
tmp.rename(dest) tmp.rename(dest)
print(f" Saved: {dest} ({downloaded / 1_048_576:.0f} MB)") print(f" Saved: {dest} ({downloaded / 1_048_576:.0f} MB)")
def fetch_release(api_url: str) -> dict | None: def fetch_release(api_url: str) -> dict | None:
"""Fetch a GitHub release JSON.""" """Fetch a GitHub release JSON."""
req = Request(api_url, headers={"User-Agent": "Mozilla/5.0"}) req = Request(api_url, headers={"User-Agent": "Mozilla/5.0"})
try: try:
resp = urlopen(req, timeout=30) resp = urlopen(req, timeout=30)
return json.loads(resp.read()) return json.loads(resp.read())
except Exception as e: except Exception as e:
print(f" Could not fetch from {api_url}: {e}") print(f" Could not fetch from {api_url}: {e}")
return None return None
def extract_zip(zip_path: Path): def extract_zip(zip_path: Path):
"""Extract zip contents into WHISPER_DIR, flattened.""" """Extract zip contents into WHISPER_DIR, flattened."""
print(f"\n Extracting to {WHISPER_DIR}/...") print(f"\n Extracting to {WHISPER_DIR}/...")
WHISPER_DIR.mkdir(exist_ok=True) WHISPER_DIR.mkdir(exist_ok=True)
with zipfile.ZipFile(zip_path) as zf: with zipfile.ZipFile(zip_path) as zf:
for member in zf.namelist(): for member in zf.namelist():
filename = Path(member).name filename = Path(member).name
if not filename: if not filename:
continue continue
target = WHISPER_DIR / filename target = WHISPER_DIR / filename
with zf.open(member) as src, open(target, "wb") as dst: with zf.open(member) as src, open(target, "wb") as dst:
dst.write(src.read()) dst.write(src.read())
print(f" {filename}") print(f" {filename}")
zip_path.unlink() zip_path.unlink()
def find_whisper_exe() -> str | None: def find_whisper_exe() -> str | None:
"""Find whisper-cli.exe (or similar) in WHISPER_DIR.""" """Find whisper-cli.exe (or similar) in WHISPER_DIR."""
whisper_exe = WHISPER_DIR / "whisper-cli.exe" whisper_exe = WHISPER_DIR / "whisper-cli.exe"
if whisper_exe.exists(): if whisper_exe.exists():
return str(whisper_exe) return str(whisper_exe)
# Try main.exe (older naming) # Try main.exe (older naming)
main_exe = WHISPER_DIR / "main.exe" main_exe = WHISPER_DIR / "main.exe"
if main_exe.exists(): if main_exe.exists():
return str(main_exe) return str(main_exe)
exes = list(WHISPER_DIR.glob("*.exe")) exes = list(WHISPER_DIR.glob("*.exe"))
for exe in exes: for exe in exes:
if "whisper" in exe.name.lower() and "cli" in exe.name.lower(): if "whisper" in exe.name.lower() and "cli" in exe.name.lower():
return str(exe) return str(exe)
for exe in exes: for exe in exes:
if "whisper" in exe.name.lower(): if "whisper" in exe.name.lower():
return str(exe) return str(exe)
if exes: if exes:
return str(exes[0]) return str(exes[0])
return None return None
def try_community_vulkan_build() -> str | None: def try_community_vulkan_build() -> str | None:
"""Try downloading Vulkan build from jerryshell's community repo.""" """Try downloading Vulkan build from jerryshell's community repo."""
print("\n Trying community Vulkan build (jerryshell/whisper.cpp-windows-vulkan-bin)...") print("\n Trying community Vulkan build (jerryshell/whisper.cpp-windows-vulkan-bin)...")
release = fetch_release(VULKAN_BUILDS_API) release = fetch_release(VULKAN_BUILDS_API)
if not release: if not release:
return None return None
tag = release.get("tag_name", "unknown") tag = release.get("tag_name", "unknown")
print(f" Community release: {tag}") print(f" Community release: {tag}")
# Find a zip asset # Find a zip asset
for asset in release.get("assets", []): for asset in release.get("assets", []):
name = asset["name"].lower() name = asset["name"].lower()
if name.endswith(".zip"): if name.endswith(".zip"):
print(f" Found: {asset['name']}") print(f" Found: {asset['name']}")
zip_path = Path(asset["name"]) zip_path = Path(asset["name"])
download_file(asset["browser_download_url"], zip_path, asset["name"]) download_file(asset["browser_download_url"], zip_path, asset["name"])
extract_zip(zip_path) extract_zip(zip_path)
return find_whisper_exe() return find_whisper_exe()
print(" No zip asset found in community release") print(" No zip asset found in community release")
return None return None
def try_official_vulkan_build() -> str | None: def try_official_vulkan_build() -> str | None:
"""Try downloading Vulkan build from official ggml-org releases.""" """Try downloading Vulkan build from official ggml-org releases."""
print("\n Fetching latest whisper.cpp release from ggml-org...") print("\n Fetching latest whisper.cpp release from ggml-org...")
release = fetch_release(GITHUB_API) release = fetch_release(GITHUB_API)
if not release: if not release:
return None return None
tag = release.get("tag_name", "unknown") tag = release.get("tag_name", "unknown")
print(f" Official release: {tag}") print(f" Official release: {tag}")
# Priority: vulkan > noavx (cpu-only, no CUDA deps) > skip CUDA entirely # Priority: CPU build (no GPU deps needed)
vulkan_asset = None cpu_asset = None
cpu_asset = None for asset in release.get("assets", []):
for asset in release.get("assets", []): name = asset["name"].lower()
name = asset["name"].lower() if not name.endswith(".zip"):
if not name.endswith(".zip"): continue
continue # Must be Windows
# Must be Windows if "win" not in name and "x64" not in name:
if "win" not in name and "x64" not in name: continue
continue # Skip GPU builds entirely
# Absolutely skip CUDA builds - they won't work on AMD if "cuda" in name or "vulkan" in name or "openblas" in name:
if "cuda" in name: continue
continue if "noavx" not in name:
if "vulkan" in name: cpu_asset = asset
vulkan_asset = asset break
break
if "noavx" not in name and "openblas" not in name: if not cpu_asset:
cpu_asset = asset print(" No CPU build found in official releases")
print(" Available assets:")
chosen = vulkan_asset or cpu_asset for asset in release.get("assets", []):
if not chosen: print(f" - {asset['name']}")
print(" No Vulkan or CPU-only build found in official releases") return None
print(" Available assets:")
for asset in release.get("assets", []): print(f" Found CPU build: {cpu_asset['name']}")
print(f" - {asset['name']}") chosen = cpu_asset
return None
zip_path = Path(chosen["name"])
if vulkan_asset: download_file(chosen["browser_download_url"], zip_path, chosen["name"])
print(f" Found official Vulkan build: {chosen['name']}") extract_zip(zip_path)
else: return find_whisper_exe()
print(f" No Vulkan build in official release, using CPU build: {chosen['name']}")
print(f" (Will work but without GPU acceleration)")
def setup_whisper_bin() -> str | None:
zip_path = Path(chosen["name"]) """Download whisper.cpp CPU release. Returns path to whisper-cli.exe."""
download_file(chosen["browser_download_url"], zip_path, chosen["name"]) whisper_exe = WHISPER_DIR / "whisper-cli.exe"
extract_zip(zip_path) if whisper_exe.exists():
return find_whisper_exe() print(f" whisper-cli.exe already exists at {whisper_exe}")
return str(whisper_exe)
def setup_whisper_bin() -> str | None: exe_path = try_official_vulkan_build()
"""Download whisper.cpp Vulkan release. Returns path to whisper-cli.exe.""" if exe_path:
whisper_exe = WHISPER_DIR / "whisper-cli.exe" print(f"\n whisper-cli.exe ready at: {exe_path}")
if whisper_exe.exists(): return exe_path
# Check if it's a CUDA build (has CUDA DLLs but no Vulkan DLL)
has_cuda = (WHISPER_DIR / "ggml-cuda.dll").exists() print("\n ERROR: Could not download whisper.cpp")
has_vulkan = (WHISPER_DIR / "ggml-vulkan.dll").exists() print(" Manual install: https://github.com/ggml-org/whisper.cpp/releases")
if has_cuda and not has_vulkan: return None
print(f" WARNING: Existing install is a CUDA build (won't work on AMD GPU)")
print(f" Removing and re-downloading Vulkan build...")
import shutil FFMPEG_DIR = Path("ffmpeg-bin")
shutil.rmtree(WHISPER_DIR) FFMPEG_URL = "https://www.gyan.dev/ffmpeg/builds/ffmpeg-release-essentials.zip"
else:
print(f" whisper-cli.exe already exists at {whisper_exe}")
return str(whisper_exe) def setup_ffmpeg() -> str | None:
"""Download ffmpeg if not found. Returns path to ffmpeg.exe."""
# Strategy: try community Vulkan build first (reliable for AMD), import shutil
# then fall back to official release
exe_path = try_community_vulkan_build() # Already in PATH?
if exe_path: if shutil.which("ffmpeg"):
print(f"\n whisper-cli.exe ready at: {exe_path} (Vulkan)") path = shutil.which("ffmpeg")
return exe_path print(f" ffmpeg already in PATH: {path}")
return path
print("\n Community build failed, trying official release...")
exe_path = try_official_vulkan_build() # Already downloaded locally?
if exe_path: local_exe = FFMPEG_DIR / "ffmpeg.exe"
print(f"\n whisper-cli.exe ready at: {exe_path}") if local_exe.exists():
return exe_path print(f" ffmpeg already exists at {local_exe}")
return str(local_exe)
print("\n ERROR: Could not download whisper.cpp")
print(" Manual install: https://github.com/ggml-org/whisper.cpp/releases") print("\n Downloading ffmpeg (essentials build)...")
print(" Build from source with: cmake -DGGML_VULKAN=1") zip_path = Path("ffmpeg-essentials.zip")
return None download_file(FFMPEG_URL, zip_path, "ffmpeg")
print(f"\n Extracting ffmpeg...")
FFMPEG_DIR = Path("ffmpeg-bin") FFMPEG_DIR.mkdir(exist_ok=True)
FFMPEG_URL = "https://www.gyan.dev/ffmpeg/builds/ffmpeg-release-essentials.zip" with zipfile.ZipFile(zip_path) as zf:
for member in zf.namelist():
# Only extract the bin/*.exe files
def setup_ffmpeg() -> str | None: if member.endswith(".exe"):
"""Download ffmpeg if not found. Returns path to ffmpeg.exe.""" filename = Path(member).name
import shutil target = FFMPEG_DIR / filename
with zf.open(member) as src, open(target, "wb") as dst:
# Already in PATH? dst.write(src.read())
if shutil.which("ffmpeg"): print(f" {filename}")
path = shutil.which("ffmpeg")
print(f" ffmpeg already in PATH: {path}") zip_path.unlink()
return path
if local_exe.exists():
# Already downloaded locally? print(f"\n ffmpeg ready at: {local_exe}")
local_exe = FFMPEG_DIR / "ffmpeg.exe" return str(local_exe)
if local_exe.exists():
print(f" ffmpeg already exists at {local_exe}") print(" ERROR: ffmpeg.exe not found after extraction")
return str(local_exe) return None
print("\n Downloading ffmpeg (essentials build)...")
zip_path = Path("ffmpeg-essentials.zip") def setup_model() -> bool:
download_file(FFMPEG_URL, zip_path, "ffmpeg") """Download whisper model. Returns True on success."""
MODELS_DIR.mkdir(exist_ok=True)
print(f"\n Extracting ffmpeg...") model_path = MODELS_DIR / MODEL_NAME
FFMPEG_DIR.mkdir(exist_ok=True)
with zipfile.ZipFile(zip_path) as zf: if model_path.exists() and model_path.stat().st_size > 100_000_000:
for member in zf.namelist(): print(f" Model already exists: {model_path} ({model_path.stat().st_size / 1_048_576:.0f} MB)")
# Only extract the bin/*.exe files return True
if member.endswith(".exe"):
filename = Path(member).name download_file(MODEL_URL, model_path, f"Whisper model ({MODEL_NAME})")
target = FFMPEG_DIR / filename
with zf.open(member) as src, open(target, "wb") as dst: if model_path.exists() and model_path.stat().st_size > 100_000_000:
dst.write(src.read()) return True
print(f" {filename}")
print(" ERROR: Model file too small or missing after download")
zip_path.unlink() return False
if local_exe.exists():
print(f"\n ffmpeg ready at: {local_exe}") def main():
return str(local_exe) what = sys.argv[1] if len(sys.argv) > 1 else "all"
print(" ERROR: ffmpeg.exe not found after extraction") if what in ("all", "ffmpeg"):
return None print("=" * 60)
print(" Setting up ffmpeg")
print("=" * 60)
def setup_model() -> bool: ffmpeg_path = setup_ffmpeg()
"""Download whisper model. Returns True on success.""" if ffmpeg_path:
MODELS_DIR.mkdir(exist_ok=True) Path(".ffmpeg_bin_path").write_text(ffmpeg_path)
model_path = MODELS_DIR / MODEL_NAME else:
print("\nFAILED to set up ffmpeg")
if model_path.exists() and model_path.stat().st_size > 100_000_000: if what == "ffmpeg":
print(f" Model already exists: {model_path} ({model_path.stat().st_size / 1_048_576:.0f} MB)") sys.exit(1)
return True
if what in ("all", "whisper"):
download_file(MODEL_URL, model_path, f"Whisper model ({MODEL_NAME})") print("=" * 60)
print(" Setting up whisper.cpp")
if model_path.exists() and model_path.stat().st_size > 100_000_000: print("=" * 60)
return True exe_path = setup_whisper_bin()
if exe_path:
print(" ERROR: Model file too small or missing after download") # Write path to temp file so run.bat can read it
return False Path(".whisper_bin_path").write_text(exe_path)
else:
print("\nFAILED to set up whisper.cpp")
def main(): if what == "whisper":
what = sys.argv[1] if len(sys.argv) > 1 else "all" sys.exit(1)
if what in ("all", "ffmpeg"): if what in ("all", "model"):
print("=" * 60) print()
print(" Setting up ffmpeg") print("=" * 60)
print("=" * 60) print(f" Downloading Whisper model: {MODEL_NAME}")
ffmpeg_path = setup_ffmpeg() print("=" * 60)
if ffmpeg_path: if not setup_model():
Path(".ffmpeg_bin_path").write_text(ffmpeg_path) print("\nFAILED to download model")
else: sys.exit(1)
print("\nFAILED to set up ffmpeg")
if what == "ffmpeg": print()
sys.exit(1) print("Setup complete!")
if what in ("all", "whisper"):
print("=" * 60) if __name__ == "__main__":
print(" Setting up whisper.cpp") main()
print("=" * 60)
exe_path = setup_whisper_bin()
if exe_path:
# Write path to temp file so run.bat can read it
Path(".whisper_bin_path").write_text(exe_path)
else:
print("\nFAILED to set up whisper.cpp")
if what == "whisper":
sys.exit(1)
if what in ("all", "model"):
print()
print("=" * 60)
print(f" Downloading Whisper model: {MODEL_NAME}")
print("=" * 60)
if not setup_model():
print("\nFAILED to download model")
sys.exit(1)
print()
print("Setup complete!")
if __name__ == "__main__":
main()

View File

@@ -1,192 +1,192 @@
""" """
Generate summaries from transcripts using Claude Code. Generate summaries from transcripts using Claude Code.
Reads manifest.json, processes each transcript, outputs per-lecture summaries, Reads manifest.json, processes each transcript, outputs per-lecture summaries,
and compiles SUPORT_CURS.md master study guide. and compiles SUPORT_CURS.md master study guide.
Usage: Usage:
python summarize.py # Print prompts for each transcript (pipe to Claude) python summarize.py # Print prompts for each transcript (pipe to Claude)
python summarize.py --compile # Compile existing summaries into SUPORT_CURS.md python summarize.py --compile # Compile existing summaries into SUPORT_CURS.md
""" """
import json import json
import sys import sys
import textwrap import textwrap
from pathlib import Path from pathlib import Path
MANIFEST_PATH = Path("manifest.json") MANIFEST_PATH = Path("manifest.json")
SUMMARIES_DIR = Path("summaries") SUMMARIES_DIR = Path("summaries")
TRANSCRIPTS_DIR = Path("transcripts") TRANSCRIPTS_DIR = Path("transcripts")
MASTER_GUIDE = Path("SUPORT_CURS.md") MASTER_GUIDE = Path("SUPORT_CURS.md")
MAX_WORDS_PER_CHUNK = 10000 MAX_WORDS_PER_CHUNK = 10000
OVERLAP_WORDS = 500 OVERLAP_WORDS = 500
SUMMARY_PROMPT = """Rezuma aceasta transcriere a unei lectii din cursul NLP Master Practitioner. SUMMARY_PROMPT = """Rezuma aceasta transcriere a unei lectii din cursul NLP Master Practitioner.
Ofera: Ofera:
1. **Prezentare generala** - 3-5 propozitii care descriu subiectul principal al lectiei 1. **Prezentare generala** - 3-5 propozitii care descriu subiectul principal al lectiei
2. **Concepte cheie** - lista cu definitii scurte pentru fiecare concept important 2. **Concepte cheie** - lista cu definitii scurte pentru fiecare concept important
3. **Detalii si exemple importante** - informatii concrete, exercitii practice, exemple relevante mentionate de trainer 3. **Detalii si exemple importante** - informatii concrete, exercitii practice, exemple relevante mentionate de trainer
4. **Citate memorabile** - fraze sau idei remarcabile (daca exista) 4. **Citate memorabile** - fraze sau idei remarcabile (daca exista)
Raspunde in limba romana. Formateaza ca Markdown. Raspunde in limba romana. Formateaza ca Markdown.
--- ---
TITLU LECTIE: {title} TITLU LECTIE: {title}
--- ---
TRANSCRIERE: TRANSCRIERE:
{text} {text}
""" """
MERGE_PROMPT = """Am mai multe rezumate partiale ale aceleiasi lectii (a fost prea lunga pentru un singur rezumat). MERGE_PROMPT = """Am mai multe rezumate partiale ale aceleiasi lectii (a fost prea lunga pentru un singur rezumat).
Combina-le intr-un singur rezumat coerent, eliminand duplicatele. Combina-le intr-un singur rezumat coerent, eliminand duplicatele.
Pastreaza structura: Pastreaza structura:
1. Prezentare generala (3-5 propozitii) 1. Prezentare generala (3-5 propozitii)
2. Concepte cheie cu definitii 2. Concepte cheie cu definitii
3. Detalii si exemple importante 3. Detalii si exemple importante
4. Citate memorabile 4. Citate memorabile
Raspunde in limba romana. Formateaza ca Markdown. Raspunde in limba romana. Formateaza ca Markdown.
--- ---
TITLU LECTIE: {title} TITLU LECTIE: {title}
--- ---
REZUMATE PARTIALE: REZUMATE PARTIALE:
{chunks} {chunks}
""" """
def load_manifest() -> dict: def load_manifest() -> dict:
with open(MANIFEST_PATH, encoding="utf-8") as f: with open(MANIFEST_PATH, encoding="utf-8") as f:
return json.load(f) return json.load(f)
def split_at_sentences(text: str, max_words: int, overlap: int) -> list[str]: def split_at_sentences(text: str, max_words: int, overlap: int) -> list[str]:
"""Split text into chunks at sentence boundaries with overlap.""" """Split text into chunks at sentence boundaries with overlap."""
words = text.split() words = text.split()
if len(words) <= max_words: if len(words) <= max_words:
return [text] return [text]
chunks = [] chunks = []
start = 0 start = 0
while start < len(words): while start < len(words):
end = min(start + max_words, len(words)) end = min(start + max_words, len(words))
chunk_words = words[start:end] chunk_words = words[start:end]
chunk_text = " ".join(chunk_words) chunk_text = " ".join(chunk_words)
# Try to break at sentence boundary (look back from end) # Try to break at sentence boundary (look back from end)
if end < len(words): if end < len(words):
for sep in [". ", "! ", "? ", ".\n", "!\n", "?\n"]: for sep in [". ", "! ", "? ", ".\n", "!\n", "?\n"]:
last_sep = chunk_text.rfind(sep) last_sep = chunk_text.rfind(sep)
if last_sep > len(chunk_text) // 2: # Don't break too early if last_sep > len(chunk_text) // 2: # Don't break too early
chunk_text = chunk_text[:last_sep + 1] chunk_text = chunk_text[:last_sep + 1]
# Recalculate end based on actual words used # Recalculate end based on actual words used
end = start + len(chunk_text.split()) end = start + len(chunk_text.split())
break break
chunks.append(chunk_text) chunks.append(chunk_text)
start = max(end - overlap, start + 1) # Overlap, but always advance start = max(end - overlap, start + 1) # Overlap, but always advance
return chunks return chunks
def generate_prompts(manifest: dict): def generate_prompts(manifest: dict):
"""Print summary prompts for each transcript to stdout.""" """Print summary prompts for each transcript to stdout."""
SUMMARIES_DIR.mkdir(exist_ok=True) SUMMARIES_DIR.mkdir(exist_ok=True)
for mod in manifest["modules"]: for mod in manifest["modules"]:
for lec in mod["lectures"]: for lec in mod["lectures"]:
if lec.get("transcribe_status") != "complete": if lec.get("transcribe_status") != "complete":
continue continue
summary_path = Path(lec["summary_path"]) summary_path = Path(lec["summary_path"])
if summary_path.exists() and summary_path.stat().st_size > 0: if summary_path.exists() and summary_path.stat().st_size > 0:
print(f"# SKIP (exists): {lec['title']}", file=sys.stderr) print(f"# SKIP (exists): {lec['title']}", file=sys.stderr)
continue continue
txt_path = Path(lec["transcript_path"]) txt_path = Path(lec["transcript_path"])
if not txt_path.exists(): if not txt_path.exists():
print(f"# SKIP (no transcript): {lec['title']}", file=sys.stderr) print(f"# SKIP (no transcript): {lec['title']}", file=sys.stderr)
continue continue
text = txt_path.read_text(encoding="utf-8").strip() text = txt_path.read_text(encoding="utf-8").strip()
if not text: if not text:
print(f"# SKIP (empty): {lec['title']}", file=sys.stderr) print(f"# SKIP (empty): {lec['title']}", file=sys.stderr)
continue continue
chunks = split_at_sentences(text, MAX_WORDS_PER_CHUNK, OVERLAP_WORDS) chunks = split_at_sentences(text, MAX_WORDS_PER_CHUNK, OVERLAP_WORDS)
print(f"\n{'='*60}", file=sys.stderr) print(f"\n{'='*60}", file=sys.stderr)
print(f"Lecture: {lec['title']}", file=sys.stderr) print(f"Lecture: {lec['title']}", file=sys.stderr)
print(f"Words: {len(text.split())}, Chunks: {len(chunks)}", file=sys.stderr) print(f"Words: {len(text.split())}, Chunks: {len(chunks)}", file=sys.stderr)
print(f"Output: {summary_path}", file=sys.stderr) print(f"Output: {summary_path}", file=sys.stderr)
if len(chunks) == 1: if len(chunks) == 1:
prompt = SUMMARY_PROMPT.format(title=lec["title"], text=text) prompt = SUMMARY_PROMPT.format(title=lec["title"], text=text)
print(f"SUMMARY_FILE:{summary_path}") print(f"SUMMARY_FILE:{summary_path}")
print(prompt) print(prompt)
print("---END_PROMPT---") print("---END_PROMPT---")
else: else:
# Multi-chunk: generate individual chunk prompts # Multi-chunk: generate individual chunk prompts
for i, chunk in enumerate(chunks, 1): for i, chunk in enumerate(chunks, 1):
prompt = SUMMARY_PROMPT.format( prompt = SUMMARY_PROMPT.format(
title=f"{lec['title']} (partea {i}/{len(chunks)})", title=f"{lec['title']} (partea {i}/{len(chunks)})",
text=chunk, text=chunk,
) )
print(f"CHUNK_PROMPT:{i}/{len(chunks)}:{summary_path}") print(f"CHUNK_PROMPT:{i}/{len(chunks)}:{summary_path}")
print(prompt) print(prompt)
print("---END_PROMPT---") print("---END_PROMPT---")
# Then a merge prompt # Then a merge prompt
print(f"MERGE_FILE:{summary_path}") print(f"MERGE_FILE:{summary_path}")
merge = MERGE_PROMPT.format( merge = MERGE_PROMPT.format(
title=lec["title"], title=lec["title"],
chunks="{chunk_summaries}", # Placeholder for merge step chunks="{chunk_summaries}", # Placeholder for merge step
) )
print(merge) print(merge)
print("---END_PROMPT---") print("---END_PROMPT---")
def compile_master_guide(manifest: dict): def compile_master_guide(manifest: dict):
"""Compile all summaries into SUPORT_CURS.md.""" """Compile all summaries into SUPORT_CURS.md."""
lines = [ lines = [
"# SUPORT CURS - NLP Master Practitioner Bucuresti 2025\n", "# SUPORT CURS - NLP Master Practitioner Bucuresti 2025\n",
"_Generat automat din transcrierile audio ale cursului._\n", "_Generat automat din transcrierile audio ale cursului._\n",
"---\n", "---\n",
] ]
for mod in manifest["modules"]: for mod in manifest["modules"]:
lines.append(f"\n## {mod['name']}\n") lines.append(f"\n## {mod['name']}\n")
for lec in mod["lectures"]: for lec in mod["lectures"]:
summary_path = Path(lec["summary_path"]) summary_path = Path(lec["summary_path"])
lines.append(f"\n### {lec['title']}\n") lines.append(f"\n### {lec['title']}\n")
if summary_path.exists(): if summary_path.exists():
content = summary_path.read_text(encoding="utf-8").strip() content = summary_path.read_text(encoding="utf-8").strip()
lines.append(f"{content}\n") lines.append(f"{content}\n")
else: else:
lines.append("_Rezumat indisponibil._\n") lines.append("_Rezumat indisponibil._\n")
lines.append("\n---\n") lines.append("\n---\n")
MASTER_GUIDE.write_text("\n".join(lines), encoding="utf-8") MASTER_GUIDE.write_text("\n".join(lines), encoding="utf-8")
print(f"Compiled {MASTER_GUIDE} ({MASTER_GUIDE.stat().st_size} bytes)") print(f"Compiled {MASTER_GUIDE} ({MASTER_GUIDE.stat().st_size} bytes)")
def main(): def main():
if not MANIFEST_PATH.exists(): if not MANIFEST_PATH.exists():
print("manifest.json not found. Run download.py and transcribe.py first.") print("manifest.json not found. Run download.py and transcribe.py first.")
sys.exit(1) sys.exit(1)
manifest = load_manifest() manifest = load_manifest()
if "--compile" in sys.argv: if "--compile" in sys.argv:
compile_master_guide(manifest) compile_master_guide(manifest)
else: else:
generate_prompts(manifest) generate_prompts(manifest)
if __name__ == "__main__": if __name__ == "__main__":
main() main()

View File

@@ -1,299 +1,299 @@
""" """
Batch transcription using whisper.cpp. Batch transcription using whisper.cpp.
Reads manifest.json, transcribes each audio file in module order, Reads manifest.json, transcribes each audio file in module order,
outputs .txt and .srt files, updates manifest status. outputs .txt and .srt files, updates manifest status.
Resumable: skips files with existing transcripts. Resumable: skips files with existing transcripts.
Converts MP3 -> WAV (16kHz mono) via ffmpeg before transcription. Converts MP3 -> WAV (16kHz mono) via ffmpeg before transcription.
""" """
import json import json
import logging import logging
import os import os
import shutil import shutil
import subprocess import subprocess
import sys import sys
from pathlib import Path from pathlib import Path
MANIFEST_PATH = Path("manifest.json") MANIFEST_PATH = Path("manifest.json")
TRANSCRIPTS_DIR = Path("transcripts") TRANSCRIPTS_DIR = Path("transcripts")
WAV_CACHE_DIR = Path("audio_wav") WAV_CACHE_DIR = Path("audio_wav")
# whisper.cpp defaults — override with env vars or CLI args # whisper.cpp defaults — override with env vars or CLI args
WHISPER_BIN = os.getenv("WHISPER_BIN", r"whisper-cli.exe") WHISPER_BIN = os.getenv("WHISPER_BIN", r"whisper-cli.exe")
WHISPER_MODEL = os.getenv("WHISPER_MODEL", r"models\ggml-medium-q5_0.bin") WHISPER_MODEL = os.getenv("WHISPER_MODEL", r"models\ggml-medium-q5_0.bin")
logging.basicConfig( logging.basicConfig(
level=logging.INFO, level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s", format="%(asctime)s [%(levelname)s] %(message)s",
handlers=[ handlers=[
logging.StreamHandler(), logging.StreamHandler(),
logging.FileHandler("transcribe_errors.log"), logging.FileHandler("transcribe_errors.log"),
], ],
) )
log = logging.getLogger(__name__) log = logging.getLogger(__name__)
def find_ffmpeg() -> str: def find_ffmpeg() -> str:
"""Find ffmpeg executable.""" """Find ffmpeg executable."""
if shutil.which("ffmpeg"): if shutil.which("ffmpeg"):
return "ffmpeg" return "ffmpeg"
# Check local directories # Check local directories
for p in [Path("ffmpeg.exe"), Path("ffmpeg-bin/ffmpeg.exe")]: for p in [Path("ffmpeg.exe"), Path("ffmpeg-bin/ffmpeg.exe")]:
if p.exists(): if p.exists():
return str(p.resolve()) return str(p.resolve())
# Try imageio-ffmpeg (pip fallback) # Try imageio-ffmpeg (pip fallback)
try: try:
import imageio_ffmpeg import imageio_ffmpeg
return imageio_ffmpeg.get_ffmpeg_exe() return imageio_ffmpeg.get_ffmpeg_exe()
except ImportError: except ImportError:
pass pass
return "" return ""
def convert_to_wav(audio_path: str) -> str: def convert_to_wav(audio_path: str) -> str:
""" """
Convert audio file to WAV 16kHz mono (optimal for whisper.cpp). Convert audio file to WAV 16kHz mono (optimal for whisper.cpp).
Returns path to WAV file. Skips if WAV already exists. Returns path to WAV file. Skips if WAV already exists.
""" """
src = Path(audio_path) src = Path(audio_path)
# Already a WAV file, skip # Already a WAV file, skip
if src.suffix.lower() == ".wav": if src.suffix.lower() == ".wav":
return audio_path return audio_path
WAV_CACHE_DIR.mkdir(exist_ok=True) WAV_CACHE_DIR.mkdir(exist_ok=True)
wav_path = WAV_CACHE_DIR / (src.stem + ".wav") wav_path = WAV_CACHE_DIR / (src.stem + ".wav")
# Skip if already converted # Skip if already converted
if wav_path.exists() and wav_path.stat().st_size > 0: if wav_path.exists() and wav_path.stat().st_size > 0:
log.info(f" WAV cache hit: {wav_path}") log.info(f" WAV cache hit: {wav_path}")
return str(wav_path) return str(wav_path)
ffmpeg = find_ffmpeg() ffmpeg = find_ffmpeg()
if not ffmpeg: if not ffmpeg:
log.warning(" ffmpeg not found, using original file (may cause bad transcription)") log.warning(" ffmpeg not found, using original file (may cause bad transcription)")
return audio_path return audio_path
log.info(f" Converting to WAV: {src.name} -> {wav_path.name}") log.info(f" Converting to WAV: {src.name} -> {wav_path.name}")
cmd = [ cmd = [
ffmpeg, ffmpeg,
"-i", audio_path, "-i", audio_path,
"-vn", # no video "-vn", # no video
"-acodec", "pcm_s16le", # 16-bit PCM "-acodec", "pcm_s16le", # 16-bit PCM
"-ar", "16000", # 16kHz sample rate (whisper standard) "-ar", "16000", # 16kHz sample rate (whisper standard)
"-ac", "1", # mono "-ac", "1", # mono
"-y", # overwrite "-y", # overwrite
str(wav_path), str(wav_path),
] ]
try: try:
result = subprocess.run( result = subprocess.run(
cmd, cmd,
capture_output=True, capture_output=True,
text=True, text=True,
timeout=300, # 5 min max for conversion timeout=300, # 5 min max for conversion
) )
if result.returncode != 0: if result.returncode != 0:
log.error(f" ffmpeg failed: {result.stderr[:300]}") log.error(f" ffmpeg failed: {result.stderr[:300]}")
return audio_path return audio_path
log.info(f" WAV ready: {wav_path.name} ({wav_path.stat().st_size / 1_048_576:.0f} MB)") log.info(f" WAV ready: {wav_path.name} ({wav_path.stat().st_size / 1_048_576:.0f} MB)")
return str(wav_path) return str(wav_path)
except FileNotFoundError: except FileNotFoundError:
log.warning(f" ffmpeg not found at: {ffmpeg}") log.warning(f" ffmpeg not found at: {ffmpeg}")
return audio_path return audio_path
except subprocess.TimeoutExpired: except subprocess.TimeoutExpired:
log.error(f" ffmpeg conversion timeout for {audio_path}") log.error(f" ffmpeg conversion timeout for {audio_path}")
return audio_path return audio_path
def load_manifest() -> dict: def load_manifest() -> dict:
with open(MANIFEST_PATH, encoding="utf-8") as f: with open(MANIFEST_PATH, encoding="utf-8") as f:
return json.load(f) return json.load(f)
def save_manifest(manifest: dict): def save_manifest(manifest: dict):
with open(MANIFEST_PATH, "w", encoding="utf-8") as f: with open(MANIFEST_PATH, "w", encoding="utf-8") as f:
json.dump(manifest, f, indent=2, ensure_ascii=False) json.dump(manifest, f, indent=2, ensure_ascii=False)
def transcribe_file(audio_path: str, output_base: str) -> bool: def transcribe_file(audio_path: str, output_base: str) -> bool:
""" """
Run whisper.cpp on a single file. Run whisper.cpp on a single file.
Returns True on success. Returns True on success.
""" """
cmd = [ cmd = [
WHISPER_BIN, WHISPER_BIN,
"--model", WHISPER_MODEL, "--model", WHISPER_MODEL,
"--language", "ro", "--language", "ro",
"--no-gpu", "--no-gpu",
"--threads", str(os.cpu_count() or 4), "--threads", str(os.cpu_count() or 4),
"--beam-size", "1", "--beam-size", "1",
"--best-of", "1", "--best-of", "1",
"--output-txt", "--output-txt",
"--output-srt", "--output-srt",
"--output-file", output_base, "--output-file", output_base,
"--file", audio_path, "--file", audio_path,
] ]
log.info(f" CMD: {' '.join(cmd)}") log.info(f" CMD: {' '.join(cmd)}")
try: try:
# Add whisper.exe's directory to PATH so Windows finds its DLLs # Add whisper.exe's directory to PATH so Windows finds its DLLs
env = os.environ.copy() env = os.environ.copy()
whisper_dir = str(Path(WHISPER_BIN).resolve().parent) whisper_dir = str(Path(WHISPER_BIN).resolve().parent)
env["PATH"] = whisper_dir + os.pathsep + env.get("PATH", "") env["PATH"] = whisper_dir + os.pathsep + env.get("PATH", "")
result = subprocess.run( result = subprocess.run(
cmd, cmd,
stdout=sys.stdout, stdout=sys.stdout,
stderr=sys.stderr, stderr=sys.stderr,
timeout=7200, # 2 hour timeout per file timeout=7200, # 2 hour timeout per file
env=env, env=env,
) )
if result.returncode != 0: if result.returncode != 0:
log.error(f" whisper.cpp failed (exit {result.returncode})") log.error(f" whisper.cpp failed (exit {result.returncode})")
return False return False
# Verify output exists and is non-empty # Verify output exists and is non-empty
txt_path = Path(f"{output_base}.txt") txt_path = Path(f"{output_base}.txt")
srt_path = Path(f"{output_base}.srt") srt_path = Path(f"{output_base}.srt")
if not txt_path.exists() or txt_path.stat().st_size == 0: if not txt_path.exists() or txt_path.stat().st_size == 0:
log.error(f" Empty or missing transcript: {txt_path}") log.error(f" Empty or missing transcript: {txt_path}")
return False return False
log.info(f" Output: {txt_path.name} ({txt_path.stat().st_size} bytes)") log.info(f" Output: {txt_path.name} ({txt_path.stat().st_size} bytes)")
if srt_path.exists(): if srt_path.exists():
log.info(f" Output: {srt_path.name} ({srt_path.stat().st_size} bytes)") log.info(f" Output: {srt_path.name} ({srt_path.stat().st_size} bytes)")
return True return True
except subprocess.TimeoutExpired: except subprocess.TimeoutExpired:
log.error(f" Timeout (>2h) for {audio_path}") log.error(f" Timeout (>2h) for {audio_path}")
return False return False
except FileNotFoundError: except FileNotFoundError:
log.error(f" whisper.cpp not found at: {WHISPER_BIN}") log.error(f" whisper.cpp not found at: {WHISPER_BIN}")
log.error(f" Set WHISPER_BIN env var or put whisper-cli.exe in PATH") log.error(f" Set WHISPER_BIN env var or put whisper-cli.exe in PATH")
return False return False
except Exception as e: except Exception as e:
log.error(f" Error: {e}") log.error(f" Error: {e}")
return False return False
def parse_module_filter(arg: str) -> set[int]: def parse_module_filter(arg: str) -> set[int]:
"""Parse module filter like '1-3' or '4,5' or '1-3,5' into a set of 1-based indices.""" """Parse module filter like '1-3' or '4,5' or '1-3,5' into a set of 1-based indices."""
result = set() result = set()
for part in arg.split(","): for part in arg.split(","):
part = part.strip() part = part.strip()
if "-" in part: if "-" in part:
a, b = part.split("-", 1) a, b = part.split("-", 1)
result.update(range(int(a), int(b) + 1)) result.update(range(int(a), int(b) + 1))
else: else:
result.add(int(part)) result.add(int(part))
return result return result
def main(): def main():
if not MANIFEST_PATH.exists(): if not MANIFEST_PATH.exists():
log.error("manifest.json not found. Run download.py first.") log.error("manifest.json not found. Run download.py first.")
sys.exit(1) sys.exit(1)
# Parse --modules filter # Parse --modules filter
module_filter = None module_filter = None
if "--modules" in sys.argv: if "--modules" in sys.argv:
idx = sys.argv.index("--modules") idx = sys.argv.index("--modules")
if idx + 1 < len(sys.argv): if idx + 1 < len(sys.argv):
module_filter = parse_module_filter(sys.argv[idx + 1]) module_filter = parse_module_filter(sys.argv[idx + 1])
log.info(f"Module filter: {sorted(module_filter)}") log.info(f"Module filter: {sorted(module_filter)}")
manifest = load_manifest() manifest = load_manifest()
TRANSCRIPTS_DIR.mkdir(exist_ok=True) TRANSCRIPTS_DIR.mkdir(exist_ok=True)
total = 0 total = 0
transcribed = 0 transcribed = 0
skipped = 0 skipped = 0
failed = 0 failed = 0
for mod_idx, mod in enumerate(manifest["modules"], 1): for mod_idx, mod in enumerate(manifest["modules"], 1):
if module_filter and mod_idx not in module_filter: if module_filter and mod_idx not in module_filter:
log.info(f"\nSkipping module {mod_idx}: {mod['name']}") log.info(f"\nSkipping module {mod_idx}: {mod['name']}")
continue continue
log.info(f"\n{'='*60}") log.info(f"\n{'='*60}")
log.info(f"Module: {mod['name']}") log.info(f"Module: {mod['name']}")
log.info(f"{'='*60}") log.info(f"{'='*60}")
for lec in mod["lectures"]: for lec in mod["lectures"]:
total += 1 total += 1
if lec.get("download_status") != "complete": if lec.get("download_status") != "complete":
log.warning(f" Skipping (not downloaded): {lec['title']}") log.warning(f" Skipping (not downloaded): {lec['title']}")
continue continue
audio_path = lec["audio_path"] audio_path = lec["audio_path"]
stem = Path(lec["original_filename"]).stem.replace(" [Audio]", "") stem = Path(lec["original_filename"]).stem.replace(" [Audio]", "")
output_base = str(TRANSCRIPTS_DIR / stem) output_base = str(TRANSCRIPTS_DIR / stem)
# Check if already transcribed # Check if already transcribed
txt_path = Path(f"{output_base}.txt") txt_path = Path(f"{output_base}.txt")
if txt_path.exists() and txt_path.stat().st_size > 0: if txt_path.exists() and txt_path.stat().st_size > 0:
lec["transcribe_status"] = "complete" lec["transcribe_status"] = "complete"
skipped += 1 skipped += 1
log.info(f" Skipping (exists): {stem}.txt") log.info(f" Skipping (exists): {stem}.txt")
continue continue
log.info(f" Transcribing: {lec['title']}") log.info(f" Transcribing: {lec['title']}")
log.info(f" File: {audio_path}") log.info(f" File: {audio_path}")
# Convert MP3 -> WAV 16kHz mono for reliable whisper.cpp input # Convert MP3 -> WAV 16kHz mono for reliable whisper.cpp input
wav_path = convert_to_wav(audio_path) wav_path = convert_to_wav(audio_path)
if transcribe_file(wav_path, output_base): if transcribe_file(wav_path, output_base):
lec["transcribe_status"] = "complete" lec["transcribe_status"] = "complete"
transcribed += 1 transcribed += 1
else: else:
lec["transcribe_status"] = "failed" lec["transcribe_status"] = "failed"
failed += 1 failed += 1
# Save manifest after each file (checkpoint) # Save manifest after each file (checkpoint)
save_manifest(manifest) save_manifest(manifest)
# Quality gate: pause after first module # Quality gate: pause after first module
if mod == manifest["modules"][0] and transcribed > 0: if mod == manifest["modules"][0] and transcribed > 0:
log.info("\n" + "!" * 60) log.info("\n" + "!" * 60)
log.info("QUALITY GATE: First module complete.") log.info("QUALITY GATE: First module complete.")
log.info("Spot-check 2-3 transcripts in transcripts/ before continuing.") log.info("Spot-check 2-3 transcripts in transcripts/ before continuing.")
log.info("Press Enter to continue, or Ctrl+C to abort...") log.info("Press Enter to continue, or Ctrl+C to abort...")
log.info("!" * 60) log.info("!" * 60)
try: try:
input() input()
except EOFError: except EOFError:
pass # Non-interactive mode, continue pass # Non-interactive mode, continue
# Validation # Validation
empty_outputs = [ empty_outputs = [
lec["title"] lec["title"]
for mod in manifest["modules"] for mod in manifest["modules"]
for lec in mod["lectures"] for lec in mod["lectures"]
if lec.get("transcribe_status") == "complete" if lec.get("transcribe_status") == "complete"
and not Path(lec["transcript_path"]).exists() and not Path(lec["transcript_path"]).exists()
] ]
log.info("\n" + "=" * 60) log.info("\n" + "=" * 60)
log.info(f"Transcribed {transcribed}/{total} files, {skipped} skipped, {failed} failures.") log.info(f"Transcribed {transcribed}/{total} files, {skipped} skipped, {failed} failures.")
log.info(f"No empty outputs: {'PASS' if not empty_outputs else 'FAIL'}") log.info(f"No empty outputs: {'PASS' if not empty_outputs else 'FAIL'}")
if empty_outputs: if empty_outputs:
for t in empty_outputs: for t in empty_outputs:
log.error(f" Missing transcript: {t}") log.error(f" Missing transcript: {t}")
log.info("=" * 60) log.info("=" * 60)
save_manifest(manifest) save_manifest(manifest)
if failed: if failed:
sys.exit(1) sys.exit(1)
if __name__ == "__main__": if __name__ == "__main__":
main() main()