chore: normalize line endings from CRLF to LF across all files
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
68
.gitignore
vendored
68
.gitignore
vendored
@@ -1,34 +1,34 @@
|
||||
# Audio files
|
||||
audio/
|
||||
*.mp3
|
||||
*.wav
|
||||
|
||||
# Whisper models
|
||||
models/
|
||||
*.bin
|
||||
|
||||
# Credentials
|
||||
.env
|
||||
|
||||
# Transcripts and summaries (large generated content)
|
||||
transcripts/
|
||||
summaries/
|
||||
|
||||
# Binaries (downloaded by setup_whisper.py)
|
||||
whisper-bin/
|
||||
ffmpeg-bin/
|
||||
|
||||
# Temp files
|
||||
.whisper_bin_path
|
||||
.ffmpeg_bin_path
|
||||
|
||||
# WAV cache (converted from MP3)
|
||||
audio_wav/
|
||||
|
||||
# Python
|
||||
__pycache__/
|
||||
*.pyc
|
||||
.venv/
|
||||
|
||||
# Logs
|
||||
*.log
|
||||
# Audio files
|
||||
audio/
|
||||
*.mp3
|
||||
*.wav
|
||||
|
||||
# Whisper models
|
||||
models/
|
||||
*.bin
|
||||
|
||||
# Credentials
|
||||
.env
|
||||
|
||||
# Transcripts and summaries (large generated content)
|
||||
transcripts/
|
||||
summaries/
|
||||
|
||||
# Binaries (downloaded by setup_whisper.py)
|
||||
whisper-bin/
|
||||
ffmpeg-bin/
|
||||
|
||||
# Temp files
|
||||
.whisper_bin_path
|
||||
.ffmpeg_bin_path
|
||||
|
||||
# WAV cache (converted from MP3)
|
||||
audio_wav/
|
||||
|
||||
# Python
|
||||
__pycache__/
|
||||
*.pyc
|
||||
.venv/
|
||||
|
||||
# Logs
|
||||
*.log
|
||||
|
||||
486
PLAN.md
486
PLAN.md
@@ -1,243 +1,243 @@
|
||||
# Design: NLP Master Course Audio Pipeline
|
||||
|
||||
Generated by /office-hours on 2026-03-23
|
||||
Branch: unknown
|
||||
Repo: nlp-master (local, no git)
|
||||
Status: APPROVED
|
||||
Mode: Builder
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Marius has an NLP master course hosted at cursuri.aresens.ro/curs/26 with 35 audio recordings (5 modules x 7 lectures, ~95 minutes each, ~58 hours total) in Romanian. The audio is behind a password-protected website. He wants to download all audio files, transcribe them offline using his AMD Radeon RX 6600M 8GB GPU, and generate clean transcripts with per-lecture summaries as study materials.
|
||||
|
||||
## What Makes This Cool
|
||||
|
||||
58 hours of Romanian lecture audio turned into searchable, summarized study materials — completely automated. Download once, transcribe overnight, summarize with Claude Code. A pipeline that would take weeks of manual work happens in hours.
|
||||
|
||||
## Constraints
|
||||
|
||||
- **Hardware:** AMD Radeon RX 6600M 8GB (RDNA2) — no CUDA, needs Vulkan or ROCm
|
||||
- **Language:** Romanian audio — Whisper large-v3 has decent but not perfect Romanian support (~95% accuracy on clean audio)
|
||||
- **Source:** Password-protected website at cursuri.aresens.ro/curs/26
|
||||
- **Scale:** ~35 MP3 files, ~95 min each, ~58 hours total
|
||||
- **Privacy:** Course content is for personal study use only
|
||||
- **Tooling:** Claude Code available for summary generation (no separate API cost)
|
||||
- **Platform:** Native Windows (Python + whisper.cpp + Vulkan). Claude Code runs from WSL2 for summaries.
|
||||
- **Summaries language:** Romanian (matching source material)
|
||||
- **Audio format:** MP3, 320kbps, 48kHz stereo, ~218MB per file (verified from sample: "Master 25M1 Z1A [Audio].mp3")
|
||||
|
||||
## Premises
|
||||
|
||||
1. Legitimate access to the course — downloading audio for personal study is within usage rights
|
||||
2. whisper.cpp with Vulkan backend is the right tool for RX 6600M (avoids ROCm compatibility issues on RDNA2)
|
||||
3. Audio quality is decent (recorded lectures) — Whisper large-v3 will produce usable Romanian transcripts
|
||||
4. Summaries will be generated by Claude Code after transcription — separate step
|
||||
5. Batch pipeline (download all → transcribe all → summarize all) is preferred over incremental processing
|
||||
|
||||
## Approaches Considered
|
||||
|
||||
### Approach A: Full Pipeline (CHOSEN)
|
||||
Python script for website login + MP3 download. Shell script for whisper.cpp batch transcription (Vulkan, large-v3-q5_0). Claude Code for per-lecture summaries from transcripts.
|
||||
- Effort: M (human: ~2 days / CC: ~30 min to build, ~8 hours to run transcription)
|
||||
- Risk: Low
|
||||
- Pros: Complete automation, reproducible for module 6, best quality
|
||||
- Cons: whisper.cpp Vulkan build requires system setup
|
||||
|
||||
### Approach B: Download + Transcribe Only
|
||||
Same download + transcription, no automated summaries. Simpler but defers the valuable part.
|
||||
- Effort: S (human: ~1 day / CC: ~20 min)
|
||||
- Risk: Low
|
||||
|
||||
### Approach C: Fully Offline (Local LLM summaries)
|
||||
Everything offline including summaries via llama.cpp. Zero external costs but lower summary quality.
|
||||
- Effort: M (human: ~2 days / CC: ~40 min)
|
||||
- Risk: Medium (8GB VRAM shared between whisper.cpp and llama.cpp)
|
||||
|
||||
## Recommended Approach
|
||||
|
||||
**Approach A: Full Pipeline** — Download → whisper.cpp/Vulkan → Claude Code summaries.
|
||||
|
||||
**Execution model:** Everything runs on native Windows (Python, whisper.cpp). Claude Code runs from WSL2 for the summary step.
|
||||
|
||||
### Step 0: Project Setup
|
||||
- Initialize git repo with `.gitignore` (exclude: `audio/`, `models/`, `.env`, `*.mp3`, `*.wav`, `*.bin`)
|
||||
- Install Python on Windows (if not already)
|
||||
- Install Vulkan SDK on Windows
|
||||
- Create `.env` with course credentials (never committed)
|
||||
|
||||
### Step 1: Site Recon + Download Audio Files
|
||||
- **First:** Browse cursuri.aresens.ro/curs/26 to understand page structure (login form, module layout, MP3 link format)
|
||||
- Based on recon, write `download.py` using the right scraping approach (requests+BS4 for static, playwright for JS-rendered — don't build both)
|
||||
- Login with credentials from `.env` or interactive prompt
|
||||
- Discover all modules dynamically (don't hardcode 5x7 — actual count may vary)
|
||||
- Preserve original file names (e.g., "Master 25M1 Z1A [Audio].mp3") and extract lecture titles
|
||||
- Write `manifest.json` mapping each file to: module, lecture title, original URL, file path, download status
|
||||
- **Resumability:** skip already-downloaded files (check existence + file size). Retry 3x with backoff. Log to `download_errors.log`.
|
||||
- **Validation:** after download completes, print summary: "Downloaded X/Y files, Z failures. All files > 1MB: pass/fail."
|
||||
|
||||
### Step 2: Install whisper.cpp with Vulkan (Windows native)
|
||||
- Option A: Download pre-built Windows binary with Vulkan from [whisper.cpp-windows-vulkan-bin](https://github.com/jerryshell/whisper.cpp-windows-vulkan-bin)
|
||||
- Option B: Build from source with Visual Studio + `-DGGML_VULKAN=1` CMake flag
|
||||
- Download model: `ggml-large-v3-q5_0.bin` (~1.5GB) from Hugging Face into `models/`
|
||||
- **VRAM test:** transcribe a 2-min clip from the first lecture to verify GPU detection, measure speed, and validate MP3 input works. If MP3 fails (whisper.cpp built without ffmpeg libs), install ffmpeg or pre-convert with Python pydub.
|
||||
- **Speed calibration:** RX 6600M is roughly half the speed of RX 9070 XT. Realistic estimate: **3-5x realtime** (~18-30 min per 90-min file). Total: **~12-18 hours** for all files. Plan for a full day, not overnight.
|
||||
- **Fallback:** if large-v3-q5_0 OOMs on 8GB, try `ggml-large-v3-q4_0.bin` or `ggml-medium-q5_0.bin`.
|
||||
|
||||
### Step 3: Batch Transcription
|
||||
- `transcribe.py` (Python, cross-platform) reads `manifest.json`, processes files in module order
|
||||
- Calls whisper.cpp with: `--language ro --model models\ggml-large-v3-q5_0.bin --output-txt --output-srt`
|
||||
- Output .txt and .srt per file to `transcripts/{original_name_without_ext}/`
|
||||
- Updates `manifest.json` with transcription status per file
|
||||
- **Resumability:** skip files with existing .txt output. Log failures to `transcribe_errors.log`.
|
||||
- **Quality gate:** after first module completes (~2 hours), STOP and spot-check 2-3 transcripts. If Romanian accuracy is poor (lots of garbled text), consider: switching to `large-v3` unquantized, adjusting `--beam-size`, or accepting lower quality.
|
||||
- **Validation:** print summary: "Transcribed X/Y files. Z failures. No empty outputs: pass/fail."
|
||||
|
||||
### Step 4: Summary Generation with Claude Code
|
||||
- From WSL2, use Claude Code to process each transcript
|
||||
- Use a Python script (`summarize.py`) that reads `manifest.json`, opens each .txt file, and prints the summary prompt for Claude Code
|
||||
- Summary prompt (Romanian): "Rezuma aceasta transcriere. Ofera: (1) prezentare generala in 3-5 propozitii, (2) concepte cheie cu definitii, (3) detalii si exemple importante"
|
||||
- **Chunking:** split transcripts > 10K words at sentence boundaries (not raw word count) with 500-word overlap. Summarize chunks, then merge.
|
||||
- Output to `summaries/{original_name}_summary.md`
|
||||
- Final: compile `SUPORT_CURS.md` — master study guide with lecture titles as headings
|
||||
|
||||
### Manifest Schema
|
||||
```json
|
||||
{
|
||||
"course": "NLP Master 2025",
|
||||
"source_url": "https://cursuri.aresens.ro/curs/26",
|
||||
"modules": [
|
||||
{
|
||||
"name": "Modul 1",
|
||||
"lectures": [
|
||||
{
|
||||
"title": "Master 25M1 Z1A",
|
||||
"original_filename": "Master 25M1 Z1A [Audio].mp3",
|
||||
"url": "https://...",
|
||||
"audio_path": "audio/Master 25M1 Z1A [Audio].mp3",
|
||||
"transcript_path": "transcripts/Master 25M1 Z1A.txt",
|
||||
"srt_path": "transcripts/Master 25M1 Z1A.srt",
|
||||
"summary_path": "summaries/Master 25M1 Z1A_summary.md",
|
||||
"download_status": "complete",
|
||||
"transcribe_status": "pending",
|
||||
"file_size_bytes": 228486429
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Directory Structure
|
||||
```
|
||||
nlp-master/
|
||||
.gitignore # Excludes audio/, models/, .env
|
||||
.env # Course credentials (not committed)
|
||||
manifest.json # Shared metadata for all scripts
|
||||
download.py # Step 1: site recon + download
|
||||
transcribe.py # Step 3: batch transcription
|
||||
summarize.py # Step 4: summary generation helper
|
||||
audio/
|
||||
Master 25M1 Z1A [Audio].mp3
|
||||
Master 25M1 Z1B [Audio].mp3
|
||||
...
|
||||
models/
|
||||
ggml-large-v3-q5_0.bin
|
||||
transcripts/
|
||||
Master 25M1 Z1A.txt
|
||||
Master 25M1 Z1A.srt
|
||||
...
|
||||
summaries/
|
||||
Master 25M1 Z1A_summary.md
|
||||
...
|
||||
SUPORT_CURS.md
|
||||
```
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. ~~What is the exact website structure?~~ Resolved: browse site first in Step 1.
|
||||
2. ~~Are there lecture titles on the website?~~ Resolved: preserve original names + extract titles.
|
||||
3. ~~Do you want the summaries in Romanian or English?~~ Resolved: Romanian.
|
||||
4. Should the master study guide (SUPORT_CURS.md) include the full transcripts or just summaries?
|
||||
5. Is there a 6th module coming? If so, the pipeline should be easily re-runnable.
|
||||
6. Does whisper.cpp Windows binary support MP3 input natively? (Validated in Step 2 VRAM test)
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- All ~35 MP3 files downloaded and organized by module
|
||||
- All files transcribed to .txt and .srt with >90% accuracy
|
||||
- Per-lecture summaries generated with key concepts extracted
|
||||
- Master study guide (SUPORT_CURS.md) ready for reading/searching
|
||||
- Pipeline is re-runnable for module 6 when it arrives
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **git init + .gitignore** — set up project, exclude audio/models/.env (~2 min)
|
||||
2. **Browse cursuri.aresens.ro** — understand site structure before coding (~10 min)
|
||||
3. **Build download.py** — login + scrape + download + manifest.json (~15 min with CC)
|
||||
4. **Install whisper.cpp on Windows** — pre-built binary or build from source + Vulkan SDK (~15 min)
|
||||
5. **Download whisper model** — large-v3-q5_0 from Hugging Face (~5 min)
|
||||
6. **Test transcription** — 2-min clip, validate GPU, calibrate speed, check MP3 support (~5 min)
|
||||
7. **Build transcribe.py** — reads manifest, processes in module order, updates status (~10 min with CC)
|
||||
8. **Run batch transcription** — ~12-18 hours (leave running during workday)
|
||||
9. **Spot-check quality** — review 2-3 transcripts after Module 1 completes
|
||||
10. **Generate summaries with Claude Code** — via summarize.py helper (~30 min)
|
||||
11. **Compile SUPORT_CURS.md** — master study guide (~10 min)
|
||||
|
||||
## NOT in scope
|
||||
- Building a web UI or search interface for transcripts — just flat files
|
||||
- Automated quality scoring of transcriptions — manual spot-check is sufficient
|
||||
- Speaker diarization (identifying different speakers) — single lecturer
|
||||
- Translation to English — summaries stay in Romanian
|
||||
- CI/CD or deployment — this is a local personal pipeline
|
||||
|
||||
## What already exists
|
||||
- Nothing — greenfield project. No existing code to reuse.
|
||||
- The one existing file (`Master 25M1 Z1A [Audio].mp3`) confirms the naming pattern and audio specs.
|
||||
|
||||
## Failure Modes
|
||||
```
|
||||
FAILURE MODE | TEST? | HANDLING? | SILENT?
|
||||
================================|=======|===========|========
|
||||
Session expires during download | No | Yes (retry)| No — logged
|
||||
MP3 truncated (network drop) | Yes* | Yes (size) | No — validation
|
||||
whisper.cpp OOM on large model | No | Yes (fallback)| No — logged
|
||||
whisper.cpp can't read MP3 | No | No** | Yes — CRITICAL
|
||||
Empty transcript output | Yes* | Yes (log) | No — validation
|
||||
Poor Romanian accuracy | No | Yes (gate)| No — spot-check
|
||||
Claude Code input too large | No | Yes (chunk)| No — script handles
|
||||
manifest.json corruption | No | No | Yes — low risk
|
||||
|
||||
* = covered by inline validation checks
|
||||
** = validated in Step 2 test; if fails, install ffmpeg or use pydub
|
||||
```
|
||||
**Critical gap:** whisper.cpp MP3 support must be validated in Step 2. If it fails silently (produces garbage), the entire batch is wasted.
|
||||
|
||||
## Eng Review Decisions (2026-03-24)
|
||||
1. Hybrid platform → **All Windows Python** (not WSL2 for scripts)
|
||||
2. Browse site first → build the right scraper, not two fallback paths
|
||||
3. Preserve original file names + extract lecture titles
|
||||
4. Add manifest.json as shared metadata between scripts
|
||||
5. Python for all scripts (download.py, transcribe.py, summarize.py)
|
||||
6. Built-in validation checks in each script
|
||||
7. Feed MP3s directly (no pre-convert)
|
||||
8. Process in module order
|
||||
9. Realistic transcription estimate: 12-18 hours (not 7-8)
|
||||
|
||||
## What I noticed about how you think
|
||||
|
||||
- You said "vreau offline transcription + claude code pentru summaries" — you immediately found the pragmatic middle path between fully offline and fully API-dependent. That's good engineering instinct: use the best tool for each step rather than forcing one tool to do everything.
|
||||
- You gave concrete numbers upfront: "5 module din 6, fiecare cu 7 audio-uri" and "90-100 minute" — you'd already scoped the problem before sitting down. That's not how most people start; most people say "I have some audio files."
|
||||
- You chose "transcripts + summaries" over "just transcripts" or "full study system" — you know what's useful without over-engineering.
|
||||
|
||||
## GSTACK REVIEW REPORT
|
||||
|
||||
| Review | Trigger | Why | Runs | Status | Findings |
|
||||
|--------|---------|-----|------|--------|----------|
|
||||
| CEO Review | `/plan-ceo-review` | Scope & strategy | 0 | — | — |
|
||||
| Codex Review | `/codex review` | Independent 2nd opinion | 0 | — | — |
|
||||
| Eng Review | `/plan-eng-review` | Architecture & tests (required) | 1 | CLEAR (PLAN) | 8 issues, 0 critical gaps |
|
||||
| Design Review | `/plan-design-review` | UI/UX gaps | 0 | — | — |
|
||||
|
||||
- **OUTSIDE VOICE:** Claude subagent ran — 10 findings, 3 cross-model tensions resolved (platform execution, speed estimate, module order)
|
||||
- **UNRESOLVED:** 0
|
||||
- **VERDICT:** ENG CLEARED — ready to implement
|
||||
# Design: NLP Master Course Audio Pipeline
|
||||
|
||||
Generated by /office-hours on 2026-03-23
|
||||
Branch: unknown
|
||||
Repo: nlp-master (local, no git)
|
||||
Status: APPROVED
|
||||
Mode: Builder
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Marius has an NLP master course hosted at cursuri.aresens.ro/curs/26 with 35 audio recordings (5 modules x 7 lectures, ~95 minutes each, ~58 hours total) in Romanian. The audio is behind a password-protected website. He wants to download all audio files, transcribe them offline using his AMD Radeon RX 6600M 8GB GPU, and generate clean transcripts with per-lecture summaries as study materials.
|
||||
|
||||
## What Makes This Cool
|
||||
|
||||
58 hours of Romanian lecture audio turned into searchable, summarized study materials — completely automated. Download once, transcribe overnight, summarize with Claude Code. A pipeline that would take weeks of manual work happens in hours.
|
||||
|
||||
## Constraints
|
||||
|
||||
- **Hardware:** AMD Radeon RX 6600M 8GB (RDNA2) — no CUDA, needs Vulkan or ROCm
|
||||
- **Language:** Romanian audio — Whisper large-v3 has decent but not perfect Romanian support (~95% accuracy on clean audio)
|
||||
- **Source:** Password-protected website at cursuri.aresens.ro/curs/26
|
||||
- **Scale:** ~35 MP3 files, ~95 min each, ~58 hours total
|
||||
- **Privacy:** Course content is for personal study use only
|
||||
- **Tooling:** Claude Code available for summary generation (no separate API cost)
|
||||
- **Platform:** Native Windows (Python + whisper.cpp + Vulkan). Claude Code runs from WSL2 for summaries.
|
||||
- **Summaries language:** Romanian (matching source material)
|
||||
- **Audio format:** MP3, 320kbps, 48kHz stereo, ~218MB per file (verified from sample: "Master 25M1 Z1A [Audio].mp3")
|
||||
|
||||
## Premises
|
||||
|
||||
1. Legitimate access to the course — downloading audio for personal study is within usage rights
|
||||
2. whisper.cpp with Vulkan backend is the right tool for RX 6600M (avoids ROCm compatibility issues on RDNA2)
|
||||
3. Audio quality is decent (recorded lectures) — Whisper large-v3 will produce usable Romanian transcripts
|
||||
4. Summaries will be generated by Claude Code after transcription — separate step
|
||||
5. Batch pipeline (download all → transcribe all → summarize all) is preferred over incremental processing
|
||||
|
||||
## Approaches Considered
|
||||
|
||||
### Approach A: Full Pipeline (CHOSEN)
|
||||
Python script for website login + MP3 download. Shell script for whisper.cpp batch transcription (Vulkan, large-v3-q5_0). Claude Code for per-lecture summaries from transcripts.
|
||||
- Effort: M (human: ~2 days / CC: ~30 min to build, ~8 hours to run transcription)
|
||||
- Risk: Low
|
||||
- Pros: Complete automation, reproducible for module 6, best quality
|
||||
- Cons: whisper.cpp Vulkan build requires system setup
|
||||
|
||||
### Approach B: Download + Transcribe Only
|
||||
Same download + transcription, no automated summaries. Simpler but defers the valuable part.
|
||||
- Effort: S (human: ~1 day / CC: ~20 min)
|
||||
- Risk: Low
|
||||
|
||||
### Approach C: Fully Offline (Local LLM summaries)
|
||||
Everything offline including summaries via llama.cpp. Zero external costs but lower summary quality.
|
||||
- Effort: M (human: ~2 days / CC: ~40 min)
|
||||
- Risk: Medium (8GB VRAM shared between whisper.cpp and llama.cpp)
|
||||
|
||||
## Recommended Approach
|
||||
|
||||
**Approach A: Full Pipeline** — Download → whisper.cpp/Vulkan → Claude Code summaries.
|
||||
|
||||
**Execution model:** Everything runs on native Windows (Python, whisper.cpp). Claude Code runs from WSL2 for the summary step.
|
||||
|
||||
### Step 0: Project Setup
|
||||
- Initialize git repo with `.gitignore` (exclude: `audio/`, `models/`, `.env`, `*.mp3`, `*.wav`, `*.bin`)
|
||||
- Install Python on Windows (if not already)
|
||||
- Install Vulkan SDK on Windows
|
||||
- Create `.env` with course credentials (never committed)
|
||||
|
||||
### Step 1: Site Recon + Download Audio Files
|
||||
- **First:** Browse cursuri.aresens.ro/curs/26 to understand page structure (login form, module layout, MP3 link format)
|
||||
- Based on recon, write `download.py` using the right scraping approach (requests+BS4 for static, playwright for JS-rendered — don't build both)
|
||||
- Login with credentials from `.env` or interactive prompt
|
||||
- Discover all modules dynamically (don't hardcode 5x7 — actual count may vary)
|
||||
- Preserve original file names (e.g., "Master 25M1 Z1A [Audio].mp3") and extract lecture titles
|
||||
- Write `manifest.json` mapping each file to: module, lecture title, original URL, file path, download status
|
||||
- **Resumability:** skip already-downloaded files (check existence + file size). Retry 3x with backoff. Log to `download_errors.log`.
|
||||
- **Validation:** after download completes, print summary: "Downloaded X/Y files, Z failures. All files > 1MB: pass/fail."
|
||||
|
||||
### Step 2: Install whisper.cpp with Vulkan (Windows native)
|
||||
- Option A: Download pre-built Windows binary with Vulkan from [whisper.cpp-windows-vulkan-bin](https://github.com/jerryshell/whisper.cpp-windows-vulkan-bin)
|
||||
- Option B: Build from source with Visual Studio + `-DGGML_VULKAN=1` CMake flag
|
||||
- Download model: `ggml-large-v3-q5_0.bin` (~1.5GB) from Hugging Face into `models/`
|
||||
- **VRAM test:** transcribe a 2-min clip from the first lecture to verify GPU detection, measure speed, and validate MP3 input works. If MP3 fails (whisper.cpp built without ffmpeg libs), install ffmpeg or pre-convert with Python pydub.
|
||||
- **Speed calibration:** RX 6600M is roughly half the speed of RX 9070 XT. Realistic estimate: **3-5x realtime** (~18-30 min per 90-min file). Total: **~12-18 hours** for all files. Plan for a full day, not overnight.
|
||||
- **Fallback:** if large-v3-q5_0 OOMs on 8GB, try `ggml-large-v3-q4_0.bin` or `ggml-medium-q5_0.bin`.
|
||||
|
||||
### Step 3: Batch Transcription
|
||||
- `transcribe.py` (Python, cross-platform) reads `manifest.json`, processes files in module order
|
||||
- Calls whisper.cpp with: `--language ro --model models\ggml-large-v3-q5_0.bin --output-txt --output-srt`
|
||||
- Output .txt and .srt per file to `transcripts/{original_name_without_ext}/`
|
||||
- Updates `manifest.json` with transcription status per file
|
||||
- **Resumability:** skip files with existing .txt output. Log failures to `transcribe_errors.log`.
|
||||
- **Quality gate:** after first module completes (~2 hours), STOP and spot-check 2-3 transcripts. If Romanian accuracy is poor (lots of garbled text), consider: switching to `large-v3` unquantized, adjusting `--beam-size`, or accepting lower quality.
|
||||
- **Validation:** print summary: "Transcribed X/Y files. Z failures. No empty outputs: pass/fail."
|
||||
|
||||
### Step 4: Summary Generation with Claude Code
|
||||
- From WSL2, use Claude Code to process each transcript
|
||||
- Use a Python script (`summarize.py`) that reads `manifest.json`, opens each .txt file, and prints the summary prompt for Claude Code
|
||||
- Summary prompt (Romanian): "Rezuma aceasta transcriere. Ofera: (1) prezentare generala in 3-5 propozitii, (2) concepte cheie cu definitii, (3) detalii si exemple importante"
|
||||
- **Chunking:** split transcripts > 10K words at sentence boundaries (not raw word count) with 500-word overlap. Summarize chunks, then merge.
|
||||
- Output to `summaries/{original_name}_summary.md`
|
||||
- Final: compile `SUPORT_CURS.md` — master study guide with lecture titles as headings
|
||||
|
||||
### Manifest Schema
|
||||
```json
|
||||
{
|
||||
"course": "NLP Master 2025",
|
||||
"source_url": "https://cursuri.aresens.ro/curs/26",
|
||||
"modules": [
|
||||
{
|
||||
"name": "Modul 1",
|
||||
"lectures": [
|
||||
{
|
||||
"title": "Master 25M1 Z1A",
|
||||
"original_filename": "Master 25M1 Z1A [Audio].mp3",
|
||||
"url": "https://...",
|
||||
"audio_path": "audio/Master 25M1 Z1A [Audio].mp3",
|
||||
"transcript_path": "transcripts/Master 25M1 Z1A.txt",
|
||||
"srt_path": "transcripts/Master 25M1 Z1A.srt",
|
||||
"summary_path": "summaries/Master 25M1 Z1A_summary.md",
|
||||
"download_status": "complete",
|
||||
"transcribe_status": "pending",
|
||||
"file_size_bytes": 228486429
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Directory Structure
|
||||
```
|
||||
nlp-master/
|
||||
.gitignore # Excludes audio/, models/, .env
|
||||
.env # Course credentials (not committed)
|
||||
manifest.json # Shared metadata for all scripts
|
||||
download.py # Step 1: site recon + download
|
||||
transcribe.py # Step 3: batch transcription
|
||||
summarize.py # Step 4: summary generation helper
|
||||
audio/
|
||||
Master 25M1 Z1A [Audio].mp3
|
||||
Master 25M1 Z1B [Audio].mp3
|
||||
...
|
||||
models/
|
||||
ggml-large-v3-q5_0.bin
|
||||
transcripts/
|
||||
Master 25M1 Z1A.txt
|
||||
Master 25M1 Z1A.srt
|
||||
...
|
||||
summaries/
|
||||
Master 25M1 Z1A_summary.md
|
||||
...
|
||||
SUPORT_CURS.md
|
||||
```
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. ~~What is the exact website structure?~~ Resolved: browse site first in Step 1.
|
||||
2. ~~Are there lecture titles on the website?~~ Resolved: preserve original names + extract titles.
|
||||
3. ~~Do you want the summaries in Romanian or English?~~ Resolved: Romanian.
|
||||
4. Should the master study guide (SUPORT_CURS.md) include the full transcripts or just summaries?
|
||||
5. Is there a 6th module coming? If so, the pipeline should be easily re-runnable.
|
||||
6. Does whisper.cpp Windows binary support MP3 input natively? (Validated in Step 2 VRAM test)
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- All ~35 MP3 files downloaded and organized by module
|
||||
- All files transcribed to .txt and .srt with >90% accuracy
|
||||
- Per-lecture summaries generated with key concepts extracted
|
||||
- Master study guide (SUPORT_CURS.md) ready for reading/searching
|
||||
- Pipeline is re-runnable for module 6 when it arrives
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **git init + .gitignore** — set up project, exclude audio/models/.env (~2 min)
|
||||
2. **Browse cursuri.aresens.ro** — understand site structure before coding (~10 min)
|
||||
3. **Build download.py** — login + scrape + download + manifest.json (~15 min with CC)
|
||||
4. **Install whisper.cpp on Windows** — pre-built binary or build from source + Vulkan SDK (~15 min)
|
||||
5. **Download whisper model** — large-v3-q5_0 from Hugging Face (~5 min)
|
||||
6. **Test transcription** — 2-min clip, validate GPU, calibrate speed, check MP3 support (~5 min)
|
||||
7. **Build transcribe.py** — reads manifest, processes in module order, updates status (~10 min with CC)
|
||||
8. **Run batch transcription** — ~12-18 hours (leave running during workday)
|
||||
9. **Spot-check quality** — review 2-3 transcripts after Module 1 completes
|
||||
10. **Generate summaries with Claude Code** — via summarize.py helper (~30 min)
|
||||
11. **Compile SUPORT_CURS.md** — master study guide (~10 min)
|
||||
|
||||
## NOT in scope
|
||||
- Building a web UI or search interface for transcripts — just flat files
|
||||
- Automated quality scoring of transcriptions — manual spot-check is sufficient
|
||||
- Speaker diarization (identifying different speakers) — single lecturer
|
||||
- Translation to English — summaries stay in Romanian
|
||||
- CI/CD or deployment — this is a local personal pipeline
|
||||
|
||||
## What already exists
|
||||
- Nothing — greenfield project. No existing code to reuse.
|
||||
- The one existing file (`Master 25M1 Z1A [Audio].mp3`) confirms the naming pattern and audio specs.
|
||||
|
||||
## Failure Modes
|
||||
```
|
||||
FAILURE MODE | TEST? | HANDLING? | SILENT?
|
||||
================================|=======|===========|========
|
||||
Session expires during download | No | Yes (retry)| No — logged
|
||||
MP3 truncated (network drop) | Yes* | Yes (size) | No — validation
|
||||
whisper.cpp OOM on large model | No | Yes (fallback)| No — logged
|
||||
whisper.cpp can't read MP3 | No | No** | Yes — CRITICAL
|
||||
Empty transcript output | Yes* | Yes (log) | No — validation
|
||||
Poor Romanian accuracy | No | Yes (gate)| No — spot-check
|
||||
Claude Code input too large | No | Yes (chunk)| No — script handles
|
||||
manifest.json corruption | No | No | Yes — low risk
|
||||
|
||||
* = covered by inline validation checks
|
||||
** = validated in Step 2 test; if fails, install ffmpeg or use pydub
|
||||
```
|
||||
**Critical gap:** whisper.cpp MP3 support must be validated in Step 2. If it fails silently (produces garbage), the entire batch is wasted.
|
||||
|
||||
## Eng Review Decisions (2026-03-24)
|
||||
1. Hybrid platform → **All Windows Python** (not WSL2 for scripts)
|
||||
2. Browse site first → build the right scraper, not two fallback paths
|
||||
3. Preserve original file names + extract lecture titles
|
||||
4. Add manifest.json as shared metadata between scripts
|
||||
5. Python for all scripts (download.py, transcribe.py, summarize.py)
|
||||
6. Built-in validation checks in each script
|
||||
7. Feed MP3s directly (no pre-convert)
|
||||
8. Process in module order
|
||||
9. Realistic transcription estimate: 12-18 hours (not 7-8)
|
||||
|
||||
## What I noticed about how you think
|
||||
|
||||
- You said "vreau offline transcription + claude code pentru summaries" — you immediately found the pragmatic middle path between fully offline and fully API-dependent. That's good engineering instinct: use the best tool for each step rather than forcing one tool to do everything.
|
||||
- You gave concrete numbers upfront: "5 module din 6, fiecare cu 7 audio-uri" and "90-100 minute" — you'd already scoped the problem before sitting down. That's not how most people start; most people say "I have some audio files."
|
||||
- You chose "transcripts + summaries" over "just transcripts" or "full study system" — you know what's useful without over-engineering.
|
||||
|
||||
## GSTACK REVIEW REPORT
|
||||
|
||||
| Review | Trigger | Why | Runs | Status | Findings |
|
||||
|--------|---------|-----|------|--------|----------|
|
||||
| CEO Review | `/plan-ceo-review` | Scope & strategy | 0 | — | — |
|
||||
| Codex Review | `/codex review` | Independent 2nd opinion | 0 | — | — |
|
||||
| Eng Review | `/plan-eng-review` | Architecture & tests (required) | 1 | CLEAR (PLAN) | 8 issues, 0 critical gaps |
|
||||
| Design Review | `/plan-design-review` | UI/UX gaps | 0 | — | — |
|
||||
|
||||
- **OUTSIDE VOICE:** Claude subagent ran — 10 findings, 3 cross-model tensions resolved (platform execution, speed estimate, module order)
|
||||
- **UNRESOLVED:** 0
|
||||
- **VERDICT:** ENG CLEARED — ready to implement
|
||||
|
||||
16
TODOS.md
16
TODOS.md
@@ -1,8 +1,8 @@
|
||||
# TODOS
|
||||
|
||||
## Re-run pipeline for Module 6
|
||||
- **What:** Re-run `download.py` when module 6 becomes available on cursuri.aresens.ro/curs/26
|
||||
- **Why:** Course has 6 modules total, only 5 are currently available. Pipeline is designed to be re-runnable — manifest.json + resumability means it discovers new modules and skips already-downloaded files.
|
||||
- **How:** Run `python download.py` → check manifest for new files → run `python transcribe.py` → generate summaries → update SUPORT_CURS.md
|
||||
- **Depends on:** Course provider publishing module 6
|
||||
- **Added:** 2026-03-24
|
||||
# TODOS
|
||||
|
||||
## Re-run pipeline for Module 6
|
||||
- **What:** Re-run `download.py` when module 6 becomes available on cursuri.aresens.ro/curs/26
|
||||
- **Why:** Course has 6 modules total, only 5 are currently available. Pipeline is designed to be re-runnable — manifest.json + resumability means it discovers new modules and skips already-downloaded files.
|
||||
- **How:** Run `python download.py` → check manifest for new files → run `python transcribe.py` → generate summaries → update SUPORT_CURS.md
|
||||
- **Depends on:** Course provider publishing module 6
|
||||
- **Added:** 2026-03-24
|
||||
|
||||
506
download.py
506
download.py
@@ -1,253 +1,253 @@
|
||||
"""
|
||||
Download all audio files from cursuri.aresens.ro NLP Master course.
|
||||
Logs in, discovers modules and lectures, downloads MP3s, writes manifest.json.
|
||||
Resumable: skips already-downloaded files.
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
from urllib.parse import urljoin
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
from dotenv import load_dotenv
|
||||
|
||||
BASE_URL = "https://cursuri.aresens.ro"
|
||||
COURSE_URL = f"{BASE_URL}/curs/26"
|
||||
LOGIN_URL = f"{BASE_URL}/login"
|
||||
AUDIO_DIR = Path("audio")
|
||||
MANIFEST_PATH = Path("manifest.json")
|
||||
MAX_RETRIES = 3
|
||||
RETRY_BACKOFF = [5, 15, 30]
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s [%(levelname)s] %(message)s",
|
||||
handlers=[
|
||||
logging.StreamHandler(),
|
||||
logging.FileHandler("download_errors.log"),
|
||||
],
|
||||
)
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def login(session: requests.Session, email: str, password: str) -> bool:
|
||||
"""Login and return True on success."""
|
||||
resp = session.post(LOGIN_URL, data={
|
||||
"email": email,
|
||||
"password": password,
|
||||
"act": "login",
|
||||
"remember": "on",
|
||||
}, allow_redirects=True)
|
||||
# Successful login redirects to the course page, not back to /login
|
||||
if "/login" in resp.url or "loginform" in resp.text:
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def discover_modules(session: requests.Session) -> list[dict]:
|
||||
"""Fetch course page and return list of {name, url, module_id}."""
|
||||
resp = session.get(COURSE_URL)
|
||||
resp.raise_for_status()
|
||||
soup = BeautifulSoup(resp.text, "html.parser")
|
||||
|
||||
modules = []
|
||||
for div in soup.select("div.module"):
|
||||
number_el = div.select_one("div.module__number")
|
||||
link_el = div.select_one("a.btn")
|
||||
if not number_el or not link_el:
|
||||
continue
|
||||
href = link_el.get("href", "")
|
||||
module_id = href.rstrip("/").split("/")[-1]
|
||||
modules.append({
|
||||
"name": number_el.get_text(strip=True),
|
||||
"url": urljoin(BASE_URL, href),
|
||||
"module_id": module_id,
|
||||
})
|
||||
log.info(f"Found {len(modules)} modules")
|
||||
return modules
|
||||
|
||||
|
||||
def discover_lectures(session: requests.Session, module: dict) -> list[dict]:
|
||||
"""Fetch a module page and return list of lectures with audio URLs."""
|
||||
resp = session.get(module["url"])
|
||||
resp.raise_for_status()
|
||||
soup = BeautifulSoup(resp.text, "html.parser")
|
||||
|
||||
lectures = []
|
||||
for lesson_div in soup.select("div.lesson"):
|
||||
name_el = lesson_div.select_one("div.module__name")
|
||||
source_el = lesson_div.select_one("audio source")
|
||||
if not name_el or not source_el:
|
||||
continue
|
||||
src = source_el.get("src", "").strip()
|
||||
if not src:
|
||||
continue
|
||||
audio_url = urljoin(BASE_URL, src)
|
||||
filename = src.split("/")[-1]
|
||||
title = name_el.get_text(strip=True)
|
||||
lectures.append({
|
||||
"title": title,
|
||||
"original_filename": filename,
|
||||
"url": audio_url,
|
||||
"audio_path": str(AUDIO_DIR / filename),
|
||||
})
|
||||
log.info(f" {module['name']}: {len(lectures)} lectures")
|
||||
return lectures
|
||||
|
||||
|
||||
def download_file(session: requests.Session, url: str, dest: Path) -> bool:
|
||||
"""Download a file with retry logic. Returns True on success."""
|
||||
for attempt in range(MAX_RETRIES):
|
||||
try:
|
||||
resp = session.get(url, stream=True, timeout=300)
|
||||
resp.raise_for_status()
|
||||
|
||||
# Write to temp file first, then rename (atomic)
|
||||
tmp = dest.with_suffix(".tmp")
|
||||
total = 0
|
||||
with open(tmp, "wb") as f:
|
||||
for chunk in resp.iter_content(chunk_size=1024 * 1024):
|
||||
f.write(chunk)
|
||||
total += len(chunk)
|
||||
|
||||
if total < 1_000_000: # < 1MB is suspicious
|
||||
log.warning(f"File too small ({total} bytes): {dest.name}")
|
||||
tmp.unlink(missing_ok=True)
|
||||
return False
|
||||
|
||||
tmp.rename(dest)
|
||||
log.info(f" Downloaded: {dest.name} ({total / 1_000_000:.1f} MB)")
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
wait = RETRY_BACKOFF[attempt] if attempt < len(RETRY_BACKOFF) else 30
|
||||
log.warning(f" Attempt {attempt + 1}/{MAX_RETRIES} failed for {dest.name}: {e}")
|
||||
if attempt < MAX_RETRIES - 1:
|
||||
log.info(f" Retrying in {wait}s...")
|
||||
time.sleep(wait)
|
||||
|
||||
log.error(f" FAILED after {MAX_RETRIES} attempts: {dest.name}")
|
||||
return False
|
||||
|
||||
|
||||
def load_manifest() -> dict | None:
|
||||
"""Load existing manifest if present."""
|
||||
if MANIFEST_PATH.exists():
|
||||
with open(MANIFEST_PATH) as f:
|
||||
return json.load(f)
|
||||
return None
|
||||
|
||||
|
||||
def save_manifest(manifest: dict):
|
||||
"""Write manifest.json."""
|
||||
with open(MANIFEST_PATH, "w", encoding="utf-8") as f:
|
||||
json.dump(manifest, f, indent=2, ensure_ascii=False)
|
||||
|
||||
|
||||
def main():
|
||||
load_dotenv()
|
||||
email = os.getenv("COURSE_USERNAME", "")
|
||||
password = os.getenv("COURSE_PASSWORD", "")
|
||||
if not email or not password:
|
||||
log.error("Set COURSE_USERNAME and COURSE_PASSWORD in .env")
|
||||
sys.exit(1)
|
||||
|
||||
AUDIO_DIR.mkdir(exist_ok=True)
|
||||
|
||||
session = requests.Session()
|
||||
session.headers.update({"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"})
|
||||
|
||||
log.info("Logging in...")
|
||||
if not login(session, email, password):
|
||||
log.error("Login failed. Check credentials in .env")
|
||||
sys.exit(1)
|
||||
log.info("Login successful")
|
||||
|
||||
# Discover structure
|
||||
modules = discover_modules(session)
|
||||
if not modules:
|
||||
log.error("No modules found")
|
||||
sys.exit(1)
|
||||
|
||||
manifest = {
|
||||
"course": "NLP Master Practitioner Bucuresti 2025",
|
||||
"source_url": COURSE_URL,
|
||||
"modules": [],
|
||||
}
|
||||
|
||||
total_files = 0
|
||||
downloaded = 0
|
||||
skipped = 0
|
||||
failed = 0
|
||||
|
||||
for mod in modules:
|
||||
lectures = discover_lectures(session, mod)
|
||||
module_entry = {
|
||||
"name": mod["name"],
|
||||
"module_id": mod["module_id"],
|
||||
"lectures": [],
|
||||
}
|
||||
|
||||
for lec in lectures:
|
||||
total_files += 1
|
||||
dest = Path(lec["audio_path"])
|
||||
stem = dest.stem.replace(" [Audio]", "")
|
||||
|
||||
lecture_entry = {
|
||||
"title": lec["title"],
|
||||
"original_filename": lec["original_filename"],
|
||||
"url": lec["url"],
|
||||
"audio_path": lec["audio_path"],
|
||||
"transcript_path": f"transcripts/{stem}.txt",
|
||||
"srt_path": f"transcripts/{stem}.srt",
|
||||
"summary_path": f"summaries/{stem}_summary.md",
|
||||
"download_status": "pending",
|
||||
"transcribe_status": "pending",
|
||||
"file_size_bytes": 0,
|
||||
}
|
||||
|
||||
# Skip if already downloaded
|
||||
if dest.exists() and dest.stat().st_size > 1_000_000:
|
||||
lecture_entry["download_status"] = "complete"
|
||||
lecture_entry["file_size_bytes"] = dest.stat().st_size
|
||||
skipped += 1
|
||||
log.info(f" Skipping (exists): {dest.name}")
|
||||
else:
|
||||
if download_file(session, lec["url"], dest):
|
||||
lecture_entry["download_status"] = "complete"
|
||||
lecture_entry["file_size_bytes"] = dest.stat().st_size
|
||||
downloaded += 1
|
||||
else:
|
||||
lecture_entry["download_status"] = "failed"
|
||||
failed += 1
|
||||
|
||||
module_entry["lectures"].append(lecture_entry)
|
||||
|
||||
manifest["modules"].append(module_entry)
|
||||
# Save manifest after each module (checkpoint)
|
||||
save_manifest(manifest)
|
||||
|
||||
# Final validation
|
||||
all_ok = all(
|
||||
Path(lec["audio_path"]).exists() and Path(lec["audio_path"]).stat().st_size > 1_000_000
|
||||
for mod in manifest["modules"]
|
||||
for lec in mod["lectures"]
|
||||
if lec["download_status"] == "complete"
|
||||
)
|
||||
|
||||
log.info("=" * 60)
|
||||
log.info(f"Downloaded {downloaded}/{total_files} files, {skipped} skipped, {failed} failures.")
|
||||
log.info(f"All files > 1MB: {'PASS' if all_ok else 'FAIL'}")
|
||||
log.info("=" * 60)
|
||||
|
||||
if failed:
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
"""
|
||||
Download all audio files from cursuri.aresens.ro NLP Master course.
|
||||
Logs in, discovers modules and lectures, downloads MP3s, writes manifest.json.
|
||||
Resumable: skips already-downloaded files.
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
from urllib.parse import urljoin
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
from dotenv import load_dotenv
|
||||
|
||||
BASE_URL = "https://cursuri.aresens.ro"
|
||||
COURSE_URL = f"{BASE_URL}/curs/26"
|
||||
LOGIN_URL = f"{BASE_URL}/login"
|
||||
AUDIO_DIR = Path("audio")
|
||||
MANIFEST_PATH = Path("manifest.json")
|
||||
MAX_RETRIES = 3
|
||||
RETRY_BACKOFF = [5, 15, 30]
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s [%(levelname)s] %(message)s",
|
||||
handlers=[
|
||||
logging.StreamHandler(),
|
||||
logging.FileHandler("download_errors.log"),
|
||||
],
|
||||
)
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def login(session: requests.Session, email: str, password: str) -> bool:
|
||||
"""Login and return True on success."""
|
||||
resp = session.post(LOGIN_URL, data={
|
||||
"email": email,
|
||||
"password": password,
|
||||
"act": "login",
|
||||
"remember": "on",
|
||||
}, allow_redirects=True)
|
||||
# Successful login redirects to the course page, not back to /login
|
||||
if "/login" in resp.url or "loginform" in resp.text:
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def discover_modules(session: requests.Session) -> list[dict]:
|
||||
"""Fetch course page and return list of {name, url, module_id}."""
|
||||
resp = session.get(COURSE_URL)
|
||||
resp.raise_for_status()
|
||||
soup = BeautifulSoup(resp.text, "html.parser")
|
||||
|
||||
modules = []
|
||||
for div in soup.select("div.module"):
|
||||
number_el = div.select_one("div.module__number")
|
||||
link_el = div.select_one("a.btn")
|
||||
if not number_el or not link_el:
|
||||
continue
|
||||
href = link_el.get("href", "")
|
||||
module_id = href.rstrip("/").split("/")[-1]
|
||||
modules.append({
|
||||
"name": number_el.get_text(strip=True),
|
||||
"url": urljoin(BASE_URL, href),
|
||||
"module_id": module_id,
|
||||
})
|
||||
log.info(f"Found {len(modules)} modules")
|
||||
return modules
|
||||
|
||||
|
||||
def discover_lectures(session: requests.Session, module: dict) -> list[dict]:
|
||||
"""Fetch a module page and return list of lectures with audio URLs."""
|
||||
resp = session.get(module["url"])
|
||||
resp.raise_for_status()
|
||||
soup = BeautifulSoup(resp.text, "html.parser")
|
||||
|
||||
lectures = []
|
||||
for lesson_div in soup.select("div.lesson"):
|
||||
name_el = lesson_div.select_one("div.module__name")
|
||||
source_el = lesson_div.select_one("audio source")
|
||||
if not name_el or not source_el:
|
||||
continue
|
||||
src = source_el.get("src", "").strip()
|
||||
if not src:
|
||||
continue
|
||||
audio_url = urljoin(BASE_URL, src)
|
||||
filename = src.split("/")[-1]
|
||||
title = name_el.get_text(strip=True)
|
||||
lectures.append({
|
||||
"title": title,
|
||||
"original_filename": filename,
|
||||
"url": audio_url,
|
||||
"audio_path": str(AUDIO_DIR / filename),
|
||||
})
|
||||
log.info(f" {module['name']}: {len(lectures)} lectures")
|
||||
return lectures
|
||||
|
||||
|
||||
def download_file(session: requests.Session, url: str, dest: Path) -> bool:
|
||||
"""Download a file with retry logic. Returns True on success."""
|
||||
for attempt in range(MAX_RETRIES):
|
||||
try:
|
||||
resp = session.get(url, stream=True, timeout=300)
|
||||
resp.raise_for_status()
|
||||
|
||||
# Write to temp file first, then rename (atomic)
|
||||
tmp = dest.with_suffix(".tmp")
|
||||
total = 0
|
||||
with open(tmp, "wb") as f:
|
||||
for chunk in resp.iter_content(chunk_size=1024 * 1024):
|
||||
f.write(chunk)
|
||||
total += len(chunk)
|
||||
|
||||
if total < 1_000_000: # < 1MB is suspicious
|
||||
log.warning(f"File too small ({total} bytes): {dest.name}")
|
||||
tmp.unlink(missing_ok=True)
|
||||
return False
|
||||
|
||||
tmp.rename(dest)
|
||||
log.info(f" Downloaded: {dest.name} ({total / 1_000_000:.1f} MB)")
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
wait = RETRY_BACKOFF[attempt] if attempt < len(RETRY_BACKOFF) else 30
|
||||
log.warning(f" Attempt {attempt + 1}/{MAX_RETRIES} failed for {dest.name}: {e}")
|
||||
if attempt < MAX_RETRIES - 1:
|
||||
log.info(f" Retrying in {wait}s...")
|
||||
time.sleep(wait)
|
||||
|
||||
log.error(f" FAILED after {MAX_RETRIES} attempts: {dest.name}")
|
||||
return False
|
||||
|
||||
|
||||
def load_manifest() -> dict | None:
|
||||
"""Load existing manifest if present."""
|
||||
if MANIFEST_PATH.exists():
|
||||
with open(MANIFEST_PATH) as f:
|
||||
return json.load(f)
|
||||
return None
|
||||
|
||||
|
||||
def save_manifest(manifest: dict):
|
||||
"""Write manifest.json."""
|
||||
with open(MANIFEST_PATH, "w", encoding="utf-8") as f:
|
||||
json.dump(manifest, f, indent=2, ensure_ascii=False)
|
||||
|
||||
|
||||
def main():
|
||||
load_dotenv()
|
||||
email = os.getenv("COURSE_USERNAME", "")
|
||||
password = os.getenv("COURSE_PASSWORD", "")
|
||||
if not email or not password:
|
||||
log.error("Set COURSE_USERNAME and COURSE_PASSWORD in .env")
|
||||
sys.exit(1)
|
||||
|
||||
AUDIO_DIR.mkdir(exist_ok=True)
|
||||
|
||||
session = requests.Session()
|
||||
session.headers.update({"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"})
|
||||
|
||||
log.info("Logging in...")
|
||||
if not login(session, email, password):
|
||||
log.error("Login failed. Check credentials in .env")
|
||||
sys.exit(1)
|
||||
log.info("Login successful")
|
||||
|
||||
# Discover structure
|
||||
modules = discover_modules(session)
|
||||
if not modules:
|
||||
log.error("No modules found")
|
||||
sys.exit(1)
|
||||
|
||||
manifest = {
|
||||
"course": "NLP Master Practitioner Bucuresti 2025",
|
||||
"source_url": COURSE_URL,
|
||||
"modules": [],
|
||||
}
|
||||
|
||||
total_files = 0
|
||||
downloaded = 0
|
||||
skipped = 0
|
||||
failed = 0
|
||||
|
||||
for mod in modules:
|
||||
lectures = discover_lectures(session, mod)
|
||||
module_entry = {
|
||||
"name": mod["name"],
|
||||
"module_id": mod["module_id"],
|
||||
"lectures": [],
|
||||
}
|
||||
|
||||
for lec in lectures:
|
||||
total_files += 1
|
||||
dest = Path(lec["audio_path"])
|
||||
stem = dest.stem.replace(" [Audio]", "")
|
||||
|
||||
lecture_entry = {
|
||||
"title": lec["title"],
|
||||
"original_filename": lec["original_filename"],
|
||||
"url": lec["url"],
|
||||
"audio_path": lec["audio_path"],
|
||||
"transcript_path": f"transcripts/{stem}.txt",
|
||||
"srt_path": f"transcripts/{stem}.srt",
|
||||
"summary_path": f"summaries/{stem}_summary.md",
|
||||
"download_status": "pending",
|
||||
"transcribe_status": "pending",
|
||||
"file_size_bytes": 0,
|
||||
}
|
||||
|
||||
# Skip if already downloaded
|
||||
if dest.exists() and dest.stat().st_size > 1_000_000:
|
||||
lecture_entry["download_status"] = "complete"
|
||||
lecture_entry["file_size_bytes"] = dest.stat().st_size
|
||||
skipped += 1
|
||||
log.info(f" Skipping (exists): {dest.name}")
|
||||
else:
|
||||
if download_file(session, lec["url"], dest):
|
||||
lecture_entry["download_status"] = "complete"
|
||||
lecture_entry["file_size_bytes"] = dest.stat().st_size
|
||||
downloaded += 1
|
||||
else:
|
||||
lecture_entry["download_status"] = "failed"
|
||||
failed += 1
|
||||
|
||||
module_entry["lectures"].append(lecture_entry)
|
||||
|
||||
manifest["modules"].append(module_entry)
|
||||
# Save manifest after each module (checkpoint)
|
||||
save_manifest(manifest)
|
||||
|
||||
# Final validation
|
||||
all_ok = all(
|
||||
Path(lec["audio_path"]).exists() and Path(lec["audio_path"]).stat().st_size > 1_000_000
|
||||
for mod in manifest["modules"]
|
||||
for lec in mod["lectures"]
|
||||
if lec["download_status"] == "complete"
|
||||
)
|
||||
|
||||
log.info("=" * 60)
|
||||
log.info(f"Downloaded {downloaded}/{total_files} files, {skipped} skipped, {failed} failures.")
|
||||
log.info(f"All files > 1MB: {'PASS' if all_ok else 'FAIL'}")
|
||||
log.info("=" * 60)
|
||||
|
||||
if failed:
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
||||
1064
manifest.json
1064
manifest.json
File diff suppressed because it is too large
Load Diff
@@ -1,3 +1,3 @@
|
||||
requests
|
||||
beautifulsoup4
|
||||
python-dotenv
|
||||
requests
|
||||
beautifulsoup4
|
||||
python-dotenv
|
||||
|
||||
626
run.bat
626
run.bat
@@ -1,313 +1,313 @@
|
||||
@echo off
|
||||
setlocal enabledelayedexpansion
|
||||
cd /d "%~dp0"
|
||||
|
||||
:: Prevent Vulkan from exhausting VRAM — overflow to system RAM instead of crashing
|
||||
set "GGML_VK_PREFER_HOST_MEMORY=ON"
|
||||
|
||||
echo ============================================================
|
||||
echo NLP Master - Download + Transcribe Pipeline
|
||||
echo ============================================================
|
||||
echo.
|
||||
|
||||
:: ============================================================
|
||||
:: PREREQUISITES CHECK
|
||||
:: ============================================================
|
||||
echo Checking prerequisites...
|
||||
echo.
|
||||
set "PREREQ_OK=1"
|
||||
set "NEED_WHISPER="
|
||||
set "NEED_MODEL="
|
||||
|
||||
:: --- Python ---
|
||||
python --version >nul 2>&1
|
||||
if errorlevel 1 (
|
||||
echo [X] Python NOT FOUND
|
||||
echo Install from: https://www.python.org/downloads/
|
||||
echo Make sure to check "Add Python to PATH" during install.
|
||||
echo.
|
||||
echo Cannot continue without Python. Install it and re-run.
|
||||
pause
|
||||
exit /b 1
|
||||
) else (
|
||||
for /f "tokens=2" %%v in ('python --version 2^>^&1') do echo [OK] Python %%v
|
||||
)
|
||||
|
||||
:: --- .env credentials ---
|
||||
if exist ".env" (
|
||||
findstr /m "COURSE_USERNAME=." ".env" >nul 2>&1
|
||||
if errorlevel 1 (
|
||||
echo [X] .env File exists but COURSE_USERNAME is empty
|
||||
echo Edit .env and fill in your credentials.
|
||||
set "PREREQ_OK="
|
||||
) else (
|
||||
echo [OK] .env Credentials configured
|
||||
)
|
||||
) else (
|
||||
echo [X] .env NOT FOUND
|
||||
echo Create .env with:
|
||||
echo COURSE_USERNAME=your_email
|
||||
echo COURSE_PASSWORD=your_password
|
||||
set "PREREQ_OK="
|
||||
)
|
||||
|
||||
:: --- ffmpeg ---
|
||||
set "FFMPEG_FOUND="
|
||||
set "NEED_FFMPEG="
|
||||
where ffmpeg >nul 2>&1
|
||||
if not errorlevel 1 (
|
||||
set "FFMPEG_FOUND=1"
|
||||
for /f "delims=" %%p in ('where ffmpeg 2^>nul') do set "FFMPEG_LOCATION=%%p"
|
||||
echo [OK] ffmpeg !FFMPEG_LOCATION!
|
||||
) else (
|
||||
if exist "ffmpeg.exe" (
|
||||
set "FFMPEG_FOUND=1"
|
||||
echo [OK] ffmpeg .\ffmpeg.exe (local^)
|
||||
) else (
|
||||
echo [--] ffmpeg Not found - will auto-install
|
||||
set "NEED_FFMPEG=1"
|
||||
)
|
||||
)
|
||||
|
||||
:: --- whisper-cli.exe ---
|
||||
set "WHISPER_FOUND="
|
||||
set "WHISPER_LOCATION="
|
||||
if defined WHISPER_BIN (
|
||||
if exist "%WHISPER_BIN%" (
|
||||
set "WHISPER_FOUND=1"
|
||||
set "WHISPER_LOCATION=%WHISPER_BIN% (env var)"
|
||||
)
|
||||
)
|
||||
if not defined WHISPER_FOUND (
|
||||
where whisper-cli.exe >nul 2>&1
|
||||
if not errorlevel 1 (
|
||||
set "WHISPER_FOUND=1"
|
||||
for /f "delims=" %%p in ('where whisper-cli.exe 2^>nul') do set "WHISPER_LOCATION=%%p (PATH)"
|
||||
)
|
||||
)
|
||||
if not defined WHISPER_FOUND (
|
||||
if exist "whisper-cli.exe" (
|
||||
set "WHISPER_FOUND=1"
|
||||
set "WHISPER_BIN=whisper-cli.exe"
|
||||
set "WHISPER_LOCATION=.\whisper-cli.exe (local)"
|
||||
)
|
||||
)
|
||||
if not defined WHISPER_FOUND (
|
||||
if exist "whisper-bin\whisper-cli.exe" (
|
||||
set "WHISPER_FOUND=1"
|
||||
set "WHISPER_BIN=whisper-bin\whisper-cli.exe"
|
||||
set "WHISPER_LOCATION=whisper-bin\whisper-cli.exe (auto-installed)"
|
||||
)
|
||||
)
|
||||
if not defined WHISPER_FOUND (
|
||||
if exist "whisper.cpp\build\bin\Release\whisper-cli.exe" (
|
||||
set "WHISPER_FOUND=1"
|
||||
set "WHISPER_BIN=whisper.cpp\build\bin\Release\whisper-cli.exe"
|
||||
set "WHISPER_LOCATION=whisper.cpp\build\... (local build)"
|
||||
)
|
||||
)
|
||||
|
||||
if defined WHISPER_FOUND (
|
||||
echo [OK] whisper-cli !WHISPER_LOCATION!
|
||||
) else (
|
||||
echo [--] whisper-cli Not found - will auto-download
|
||||
set "NEED_WHISPER=1"
|
||||
)
|
||||
|
||||
:: --- Whisper model ---
|
||||
if not defined WHISPER_MODEL set "WHISPER_MODEL=models\ggml-medium-q5_0.bin"
|
||||
if exist "%WHISPER_MODEL%" (
|
||||
for %%F in ("%WHISPER_MODEL%") do (
|
||||
set /a "MODEL_MB=%%~zF / 1048576"
|
||||
)
|
||||
echo [OK] Whisper model %WHISPER_MODEL% (!MODEL_MB! MB^)
|
||||
) else (
|
||||
echo [--] Whisper model Not found - will auto-download (~500 MB^)
|
||||
set "NEED_MODEL=1"
|
||||
)
|
||||
|
||||
:: --- Vulkan GPU support ---
|
||||
set "VULKAN_FOUND="
|
||||
where vulkaninfo >nul 2>&1
|
||||
if not errorlevel 1 (
|
||||
set "VULKAN_FOUND=1"
|
||||
echo [OK] Vulkan SDK Installed
|
||||
) else (
|
||||
if exist "%VULKAN_SDK%\Bin\vulkaninfo.exe" (
|
||||
set "VULKAN_FOUND=1"
|
||||
echo [OK] Vulkan SDK %VULKAN_SDK%
|
||||
) else (
|
||||
echo [!!] Vulkan SDK Not detected (whisper.cpp may use CPU fallback^)
|
||||
echo Install from: https://vulkan.lunarg.com/sdk/home
|
||||
)
|
||||
)
|
||||
|
||||
:: --- Disk space ---
|
||||
echo.
|
||||
for /f "tokens=3" %%a in ('dir /-c "%~dp0." 2^>nul ^| findstr /c:"bytes free"') do (
|
||||
set /a "FREE_GB=%%a / 1073741824" 2>nul
|
||||
)
|
||||
if defined FREE_GB (
|
||||
if !FREE_GB! LSS 50 (
|
||||
echo [!!] Disk space ~!FREE_GB! GB free (need ~50 GB for all audio + transcripts^)
|
||||
) else (
|
||||
echo [OK] Disk space ~!FREE_GB! GB free
|
||||
)
|
||||
)
|
||||
|
||||
echo.
|
||||
|
||||
:: --- Stop if .env is broken (can't auto-fix that) ---
|
||||
if not defined PREREQ_OK (
|
||||
echo ============================================================
|
||||
echo MISSING PREREQUISITES - fix the [X] items above and re-run.
|
||||
echo ============================================================
|
||||
pause
|
||||
exit /b 1
|
||||
)
|
||||
|
||||
:: ============================================================
|
||||
:: AUTO-INSTALL MISSING COMPONENTS
|
||||
:: ============================================================
|
||||
if defined NEED_FFMPEG (
|
||||
echo ============================================================
|
||||
echo Auto-downloading ffmpeg...
|
||||
echo ============================================================
|
||||
python setup_whisper.py ffmpeg
|
||||
if errorlevel 1 (
|
||||
echo.
|
||||
echo ERROR: Could not install ffmpeg.
|
||||
echo Download manually from: https://www.gyan.dev/ffmpeg/builds/
|
||||
echo Extract ffmpeg.exe to ffmpeg-bin\ and re-run.
|
||||
pause
|
||||
exit /b 1
|
||||
)
|
||||
if exist ".ffmpeg_bin_path" del .ffmpeg_bin_path
|
||||
echo.
|
||||
)
|
||||
|
||||
:: Add ffmpeg-bin to PATH if it exists
|
||||
if exist "ffmpeg-bin\ffmpeg.exe" (
|
||||
set "PATH=%~dp0ffmpeg-bin;%PATH%"
|
||||
)
|
||||
|
||||
if defined NEED_WHISPER (
|
||||
echo ============================================================
|
||||
echo Auto-downloading whisper.cpp (Vulkan build^)...
|
||||
echo ============================================================
|
||||
python setup_whisper.py whisper
|
||||
if errorlevel 1 (
|
||||
echo.
|
||||
echo ERROR: Failed to auto-download whisper.cpp.
|
||||
echo Download manually from: https://github.com/ggml-org/whisper.cpp/releases
|
||||
pause
|
||||
exit /b 1
|
||||
)
|
||||
:: Read the path that setup_whisper.py wrote
|
||||
if exist ".whisper_bin_path" (
|
||||
set /p WHISPER_BIN=<.whisper_bin_path
|
||||
del .whisper_bin_path
|
||||
echo Using: !WHISPER_BIN!
|
||||
)
|
||||
echo.
|
||||
)
|
||||
|
||||
if defined NEED_MODEL (
|
||||
echo ============================================================
|
||||
echo Auto-downloading Whisper model (ggml-medium-q5_0, ~500 MB^)...
|
||||
echo This will take a few minutes depending on your connection.
|
||||
echo ============================================================
|
||||
python setup_whisper.py model
|
||||
if errorlevel 1 (
|
||||
echo.
|
||||
echo ERROR: Failed to download model.
|
||||
echo Download manually from:
|
||||
echo https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-medium-q5_0.bin
|
||||
echo Save to: models\ggml-medium-q5_0.bin
|
||||
pause
|
||||
exit /b 1
|
||||
)
|
||||
echo.
|
||||
)
|
||||
|
||||
echo All prerequisites OK!
|
||||
echo.
|
||||
echo ============================================================
|
||||
echo Starting pipeline...
|
||||
echo ============================================================
|
||||
echo.
|
||||
|
||||
:: ============================================================
|
||||
:: STEP 1: VENV + DEPENDENCIES
|
||||
:: ============================================================
|
||||
if not exist ".venv\Scripts\python.exe" (
|
||||
echo [1/4] Creating Python virtual environment...
|
||||
python -m venv .venv
|
||||
if errorlevel 1 (
|
||||
echo ERROR: Failed to create venv.
|
||||
pause
|
||||
exit /b 1
|
||||
)
|
||||
echo Done.
|
||||
) else (
|
||||
echo [1/4] Virtual environment already exists.
|
||||
)
|
||||
|
||||
echo [2/4] Installing Python dependencies...
|
||||
.venv\Scripts\pip install -q -r requirements.txt
|
||||
if errorlevel 1 (
|
||||
echo ERROR: Failed to install dependencies.
|
||||
pause
|
||||
exit /b 1
|
||||
)
|
||||
echo Done.
|
||||
|
||||
:: ============================================================
|
||||
:: STEP 2: DOWNLOAD
|
||||
:: ============================================================
|
||||
echo.
|
||||
echo [3/4] Downloading audio files...
|
||||
echo ============================================================
|
||||
.venv\Scripts\python download.py
|
||||
if errorlevel 1 (
|
||||
echo.
|
||||
echo WARNING: Some downloads failed. Check download_errors.log
|
||||
echo Press any key to continue to transcription anyway, or Ctrl+C to abort.
|
||||
pause >nul
|
||||
)
|
||||
|
||||
:: ============================================================
|
||||
:: STEP 3: TRANSCRIBE
|
||||
:: ============================================================
|
||||
echo.
|
||||
echo [4/4] Transcribing with whisper.cpp...
|
||||
echo ============================================================
|
||||
echo Using: %WHISPER_BIN%
|
||||
echo Model: %WHISPER_MODEL%
|
||||
echo.
|
||||
|
||||
if "%~1"=="" (
|
||||
.venv\Scripts\python transcribe.py
|
||||
) else (
|
||||
echo Modules filter: %~1
|
||||
.venv\Scripts\python transcribe.py --modules %~1
|
||||
)
|
||||
if errorlevel 1 (
|
||||
echo.
|
||||
echo WARNING: Some transcriptions failed. Check transcribe_errors.log
|
||||
)
|
||||
|
||||
:: ============================================================
|
||||
:: DONE
|
||||
:: ============================================================
|
||||
echo.
|
||||
echo ============================================================
|
||||
echo Pipeline complete!
|
||||
echo - Audio files: audio\
|
||||
echo - Transcripts: transcripts\
|
||||
echo - Manifest: manifest.json
|
||||
echo.
|
||||
echo Next step: generate summaries from WSL2 with Claude Code
|
||||
echo python summarize.py
|
||||
echo ============================================================
|
||||
pause
|
||||
@echo off
|
||||
setlocal enabledelayedexpansion
|
||||
cd /d "%~dp0"
|
||||
|
||||
:: Prevent Vulkan from exhausting VRAM — overflow to system RAM instead of crashing
|
||||
set "GGML_VK_PREFER_HOST_MEMORY=ON"
|
||||
|
||||
echo ============================================================
|
||||
echo NLP Master - Download + Transcribe Pipeline
|
||||
echo ============================================================
|
||||
echo.
|
||||
|
||||
:: ============================================================
|
||||
:: PREREQUISITES CHECK
|
||||
:: ============================================================
|
||||
echo Checking prerequisites...
|
||||
echo.
|
||||
set "PREREQ_OK=1"
|
||||
set "NEED_WHISPER="
|
||||
set "NEED_MODEL="
|
||||
|
||||
:: --- Python ---
|
||||
python --version >nul 2>&1
|
||||
if errorlevel 1 (
|
||||
echo [X] Python NOT FOUND
|
||||
echo Install from: https://www.python.org/downloads/
|
||||
echo Make sure to check "Add Python to PATH" during install.
|
||||
echo.
|
||||
echo Cannot continue without Python. Install it and re-run.
|
||||
pause
|
||||
exit /b 1
|
||||
) else (
|
||||
for /f "tokens=2" %%v in ('python --version 2^>^&1') do echo [OK] Python %%v
|
||||
)
|
||||
|
||||
:: --- .env credentials ---
|
||||
if exist ".env" (
|
||||
findstr /m "COURSE_USERNAME=." ".env" >nul 2>&1
|
||||
if errorlevel 1 (
|
||||
echo [X] .env File exists but COURSE_USERNAME is empty
|
||||
echo Edit .env and fill in your credentials.
|
||||
set "PREREQ_OK="
|
||||
) else (
|
||||
echo [OK] .env Credentials configured
|
||||
)
|
||||
) else (
|
||||
echo [X] .env NOT FOUND
|
||||
echo Create .env with:
|
||||
echo COURSE_USERNAME=your_email
|
||||
echo COURSE_PASSWORD=your_password
|
||||
set "PREREQ_OK="
|
||||
)
|
||||
|
||||
:: --- ffmpeg ---
|
||||
set "FFMPEG_FOUND="
|
||||
set "NEED_FFMPEG="
|
||||
where ffmpeg >nul 2>&1
|
||||
if not errorlevel 1 (
|
||||
set "FFMPEG_FOUND=1"
|
||||
for /f "delims=" %%p in ('where ffmpeg 2^>nul') do set "FFMPEG_LOCATION=%%p"
|
||||
echo [OK] ffmpeg !FFMPEG_LOCATION!
|
||||
) else (
|
||||
if exist "ffmpeg.exe" (
|
||||
set "FFMPEG_FOUND=1"
|
||||
echo [OK] ffmpeg .\ffmpeg.exe (local^)
|
||||
) else (
|
||||
echo [--] ffmpeg Not found - will auto-install
|
||||
set "NEED_FFMPEG=1"
|
||||
)
|
||||
)
|
||||
|
||||
:: --- whisper-cli.exe ---
|
||||
set "WHISPER_FOUND="
|
||||
set "WHISPER_LOCATION="
|
||||
if defined WHISPER_BIN (
|
||||
if exist "%WHISPER_BIN%" (
|
||||
set "WHISPER_FOUND=1"
|
||||
set "WHISPER_LOCATION=%WHISPER_BIN% (env var)"
|
||||
)
|
||||
)
|
||||
if not defined WHISPER_FOUND (
|
||||
where whisper-cli.exe >nul 2>&1
|
||||
if not errorlevel 1 (
|
||||
set "WHISPER_FOUND=1"
|
||||
for /f "delims=" %%p in ('where whisper-cli.exe 2^>nul') do set "WHISPER_LOCATION=%%p (PATH)"
|
||||
)
|
||||
)
|
||||
if not defined WHISPER_FOUND (
|
||||
if exist "whisper-cli.exe" (
|
||||
set "WHISPER_FOUND=1"
|
||||
set "WHISPER_BIN=whisper-cli.exe"
|
||||
set "WHISPER_LOCATION=.\whisper-cli.exe (local)"
|
||||
)
|
||||
)
|
||||
if not defined WHISPER_FOUND (
|
||||
if exist "whisper-bin\whisper-cli.exe" (
|
||||
set "WHISPER_FOUND=1"
|
||||
set "WHISPER_BIN=whisper-bin\whisper-cli.exe"
|
||||
set "WHISPER_LOCATION=whisper-bin\whisper-cli.exe (auto-installed)"
|
||||
)
|
||||
)
|
||||
if not defined WHISPER_FOUND (
|
||||
if exist "whisper.cpp\build\bin\Release\whisper-cli.exe" (
|
||||
set "WHISPER_FOUND=1"
|
||||
set "WHISPER_BIN=whisper.cpp\build\bin\Release\whisper-cli.exe"
|
||||
set "WHISPER_LOCATION=whisper.cpp\build\... (local build)"
|
||||
)
|
||||
)
|
||||
|
||||
if defined WHISPER_FOUND (
|
||||
echo [OK] whisper-cli !WHISPER_LOCATION!
|
||||
) else (
|
||||
echo [--] whisper-cli Not found - will auto-download
|
||||
set "NEED_WHISPER=1"
|
||||
)
|
||||
|
||||
:: --- Whisper model ---
|
||||
if not defined WHISPER_MODEL set "WHISPER_MODEL=models\ggml-medium-q5_0.bin"
|
||||
if exist "%WHISPER_MODEL%" (
|
||||
for %%F in ("%WHISPER_MODEL%") do (
|
||||
set /a "MODEL_MB=%%~zF / 1048576"
|
||||
)
|
||||
echo [OK] Whisper model %WHISPER_MODEL% (!MODEL_MB! MB^)
|
||||
) else (
|
||||
echo [--] Whisper model Not found - will auto-download (~500 MB^)
|
||||
set "NEED_MODEL=1"
|
||||
)
|
||||
|
||||
:: --- Vulkan GPU support ---
|
||||
set "VULKAN_FOUND="
|
||||
where vulkaninfo >nul 2>&1
|
||||
if not errorlevel 1 (
|
||||
set "VULKAN_FOUND=1"
|
||||
echo [OK] Vulkan SDK Installed
|
||||
) else (
|
||||
if exist "%VULKAN_SDK%\Bin\vulkaninfo.exe" (
|
||||
set "VULKAN_FOUND=1"
|
||||
echo [OK] Vulkan SDK %VULKAN_SDK%
|
||||
) else (
|
||||
echo [!!] Vulkan SDK Not detected (whisper.cpp may use CPU fallback^)
|
||||
echo Install from: https://vulkan.lunarg.com/sdk/home
|
||||
)
|
||||
)
|
||||
|
||||
:: --- Disk space ---
|
||||
echo.
|
||||
for /f "tokens=3" %%a in ('dir /-c "%~dp0." 2^>nul ^| findstr /c:"bytes free"') do (
|
||||
set /a "FREE_GB=%%a / 1073741824" 2>nul
|
||||
)
|
||||
if defined FREE_GB (
|
||||
if !FREE_GB! LSS 50 (
|
||||
echo [!!] Disk space ~!FREE_GB! GB free (need ~50 GB for all audio + transcripts^)
|
||||
) else (
|
||||
echo [OK] Disk space ~!FREE_GB! GB free
|
||||
)
|
||||
)
|
||||
|
||||
echo.
|
||||
|
||||
:: --- Stop if .env is broken (can't auto-fix that) ---
|
||||
if not defined PREREQ_OK (
|
||||
echo ============================================================
|
||||
echo MISSING PREREQUISITES - fix the [X] items above and re-run.
|
||||
echo ============================================================
|
||||
pause
|
||||
exit /b 1
|
||||
)
|
||||
|
||||
:: ============================================================
|
||||
:: AUTO-INSTALL MISSING COMPONENTS
|
||||
:: ============================================================
|
||||
if defined NEED_FFMPEG (
|
||||
echo ============================================================
|
||||
echo Auto-downloading ffmpeg...
|
||||
echo ============================================================
|
||||
python setup_whisper.py ffmpeg
|
||||
if errorlevel 1 (
|
||||
echo.
|
||||
echo ERROR: Could not install ffmpeg.
|
||||
echo Download manually from: https://www.gyan.dev/ffmpeg/builds/
|
||||
echo Extract ffmpeg.exe to ffmpeg-bin\ and re-run.
|
||||
pause
|
||||
exit /b 1
|
||||
)
|
||||
if exist ".ffmpeg_bin_path" del .ffmpeg_bin_path
|
||||
echo.
|
||||
)
|
||||
|
||||
:: Add ffmpeg-bin to PATH if it exists
|
||||
if exist "ffmpeg-bin\ffmpeg.exe" (
|
||||
set "PATH=%~dp0ffmpeg-bin;%PATH%"
|
||||
)
|
||||
|
||||
if defined NEED_WHISPER (
|
||||
echo ============================================================
|
||||
echo Auto-downloading whisper.cpp (Vulkan build^)...
|
||||
echo ============================================================
|
||||
python setup_whisper.py whisper
|
||||
if errorlevel 1 (
|
||||
echo.
|
||||
echo ERROR: Failed to auto-download whisper.cpp.
|
||||
echo Download manually from: https://github.com/ggml-org/whisper.cpp/releases
|
||||
pause
|
||||
exit /b 1
|
||||
)
|
||||
:: Read the path that setup_whisper.py wrote
|
||||
if exist ".whisper_bin_path" (
|
||||
set /p WHISPER_BIN=<.whisper_bin_path
|
||||
del .whisper_bin_path
|
||||
echo Using: !WHISPER_BIN!
|
||||
)
|
||||
echo.
|
||||
)
|
||||
|
||||
if defined NEED_MODEL (
|
||||
echo ============================================================
|
||||
echo Auto-downloading Whisper model (ggml-medium-q5_0, ~500 MB^)...
|
||||
echo This will take a few minutes depending on your connection.
|
||||
echo ============================================================
|
||||
python setup_whisper.py model
|
||||
if errorlevel 1 (
|
||||
echo.
|
||||
echo ERROR: Failed to download model.
|
||||
echo Download manually from:
|
||||
echo https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-medium-q5_0.bin
|
||||
echo Save to: models\ggml-medium-q5_0.bin
|
||||
pause
|
||||
exit /b 1
|
||||
)
|
||||
echo.
|
||||
)
|
||||
|
||||
echo All prerequisites OK!
|
||||
echo.
|
||||
echo ============================================================
|
||||
echo Starting pipeline...
|
||||
echo ============================================================
|
||||
echo.
|
||||
|
||||
:: ============================================================
|
||||
:: STEP 1: VENV + DEPENDENCIES
|
||||
:: ============================================================
|
||||
if not exist ".venv\Scripts\python.exe" (
|
||||
echo [1/4] Creating Python virtual environment...
|
||||
python -m venv .venv
|
||||
if errorlevel 1 (
|
||||
echo ERROR: Failed to create venv.
|
||||
pause
|
||||
exit /b 1
|
||||
)
|
||||
echo Done.
|
||||
) else (
|
||||
echo [1/4] Virtual environment already exists.
|
||||
)
|
||||
|
||||
echo [2/4] Installing Python dependencies...
|
||||
.venv\Scripts\pip install -q -r requirements.txt
|
||||
if errorlevel 1 (
|
||||
echo ERROR: Failed to install dependencies.
|
||||
pause
|
||||
exit /b 1
|
||||
)
|
||||
echo Done.
|
||||
|
||||
:: ============================================================
|
||||
:: STEP 2: DOWNLOAD
|
||||
:: ============================================================
|
||||
echo.
|
||||
echo [3/4] Downloading audio files...
|
||||
echo ============================================================
|
||||
.venv\Scripts\python download.py
|
||||
if errorlevel 1 (
|
||||
echo.
|
||||
echo WARNING: Some downloads failed. Check download_errors.log
|
||||
echo Press any key to continue to transcription anyway, or Ctrl+C to abort.
|
||||
pause >nul
|
||||
)
|
||||
|
||||
:: ============================================================
|
||||
:: STEP 3: TRANSCRIBE
|
||||
:: ============================================================
|
||||
echo.
|
||||
echo [4/4] Transcribing with whisper.cpp...
|
||||
echo ============================================================
|
||||
echo Using: %WHISPER_BIN%
|
||||
echo Model: %WHISPER_MODEL%
|
||||
echo.
|
||||
|
||||
if "%~1"=="" (
|
||||
.venv\Scripts\python transcribe.py
|
||||
) else (
|
||||
echo Modules filter: %~1
|
||||
.venv\Scripts\python transcribe.py --modules %~1
|
||||
)
|
||||
if errorlevel 1 (
|
||||
echo.
|
||||
echo WARNING: Some transcriptions failed. Check transcribe_errors.log
|
||||
)
|
||||
|
||||
:: ============================================================
|
||||
:: DONE
|
||||
:: ============================================================
|
||||
echo.
|
||||
echo ============================================================
|
||||
echo Pipeline complete!
|
||||
echo - Audio files: audio\
|
||||
echo - Transcripts: transcripts\
|
||||
echo - Manifest: manifest.json
|
||||
echo.
|
||||
echo Next step: generate summaries from WSL2 with Claude Code
|
||||
echo python summarize.py
|
||||
echo ============================================================
|
||||
pause
|
||||
|
||||
650
setup_whisper.py
650
setup_whisper.py
@@ -1,325 +1,325 @@
|
||||
"""
|
||||
Auto-download and setup whisper.cpp (Vulkan) + model for Windows.
|
||||
Called by run.bat when prerequisites are missing.
|
||||
"""
|
||||
|
||||
import io
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
from urllib.request import urlopen, Request
|
||||
|
||||
MODELS_DIR = Path("models")
|
||||
MODEL_NAME = "ggml-medium-q5_0.bin"
|
||||
MODEL_URL = f"https://huggingface.co/ggerganov/whisper.cpp/resolve/main/{MODEL_NAME}"
|
||||
|
||||
GITHUB_API = "https://api.github.com/repos/ggml-org/whisper.cpp/releases/latest"
|
||||
# Community Vulkan builds (for AMD GPUs)
|
||||
VULKAN_BUILDS_API = "https://api.github.com/repos/jerryshell/whisper.cpp-windows-vulkan-bin/releases/latest"
|
||||
WHISPER_DIR = Path("whisper-bin")
|
||||
|
||||
|
||||
def progress_bar(current: int, total: int, width: int = 40):
|
||||
if total <= 0:
|
||||
return
|
||||
pct = current / total
|
||||
filled = int(width * pct)
|
||||
bar = "=" * filled + "-" * (width - filled)
|
||||
mb_done = current / 1_048_576
|
||||
mb_total = total / 1_048_576
|
||||
print(f"\r [{bar}] {pct:.0%} {mb_done:.0f}/{mb_total:.0f} MB", end="", flush=True)
|
||||
|
||||
|
||||
def download_file(url: str, dest: Path, desc: str):
|
||||
"""Download a file with progress bar."""
|
||||
print(f"\n Downloading {desc}...")
|
||||
print(f" URL: {url}")
|
||||
|
||||
req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
|
||||
resp = urlopen(req, timeout=60)
|
||||
|
||||
total = int(resp.headers.get("Content-Length", 0))
|
||||
downloaded = 0
|
||||
tmp = dest.with_suffix(".tmp")
|
||||
|
||||
with open(tmp, "wb") as f:
|
||||
while True:
|
||||
chunk = resp.read(1024 * 1024)
|
||||
if not chunk:
|
||||
break
|
||||
f.write(chunk)
|
||||
downloaded += len(chunk)
|
||||
progress_bar(downloaded, total)
|
||||
|
||||
print() # newline after progress bar
|
||||
tmp.rename(dest)
|
||||
print(f" Saved: {dest} ({downloaded / 1_048_576:.0f} MB)")
|
||||
|
||||
|
||||
def fetch_release(api_url: str) -> dict | None:
|
||||
"""Fetch a GitHub release JSON."""
|
||||
req = Request(api_url, headers={"User-Agent": "Mozilla/5.0"})
|
||||
try:
|
||||
resp = urlopen(req, timeout=30)
|
||||
return json.loads(resp.read())
|
||||
except Exception as e:
|
||||
print(f" Could not fetch from {api_url}: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def extract_zip(zip_path: Path):
|
||||
"""Extract zip contents into WHISPER_DIR, flattened."""
|
||||
print(f"\n Extracting to {WHISPER_DIR}/...")
|
||||
WHISPER_DIR.mkdir(exist_ok=True)
|
||||
with zipfile.ZipFile(zip_path) as zf:
|
||||
for member in zf.namelist():
|
||||
filename = Path(member).name
|
||||
if not filename:
|
||||
continue
|
||||
target = WHISPER_DIR / filename
|
||||
with zf.open(member) as src, open(target, "wb") as dst:
|
||||
dst.write(src.read())
|
||||
print(f" {filename}")
|
||||
zip_path.unlink()
|
||||
|
||||
|
||||
def find_whisper_exe() -> str | None:
|
||||
"""Find whisper-cli.exe (or similar) in WHISPER_DIR."""
|
||||
whisper_exe = WHISPER_DIR / "whisper-cli.exe"
|
||||
if whisper_exe.exists():
|
||||
return str(whisper_exe)
|
||||
|
||||
# Try main.exe (older naming)
|
||||
main_exe = WHISPER_DIR / "main.exe"
|
||||
if main_exe.exists():
|
||||
return str(main_exe)
|
||||
|
||||
exes = list(WHISPER_DIR.glob("*.exe"))
|
||||
for exe in exes:
|
||||
if "whisper" in exe.name.lower() and "cli" in exe.name.lower():
|
||||
return str(exe)
|
||||
for exe in exes:
|
||||
if "whisper" in exe.name.lower():
|
||||
return str(exe)
|
||||
if exes:
|
||||
return str(exes[0])
|
||||
return None
|
||||
|
||||
|
||||
def try_community_vulkan_build() -> str | None:
|
||||
"""Try downloading Vulkan build from jerryshell's community repo."""
|
||||
print("\n Trying community Vulkan build (jerryshell/whisper.cpp-windows-vulkan-bin)...")
|
||||
release = fetch_release(VULKAN_BUILDS_API)
|
||||
if not release:
|
||||
return None
|
||||
|
||||
tag = release.get("tag_name", "unknown")
|
||||
print(f" Community release: {tag}")
|
||||
|
||||
# Find a zip asset
|
||||
for asset in release.get("assets", []):
|
||||
name = asset["name"].lower()
|
||||
if name.endswith(".zip"):
|
||||
print(f" Found: {asset['name']}")
|
||||
zip_path = Path(asset["name"])
|
||||
download_file(asset["browser_download_url"], zip_path, asset["name"])
|
||||
extract_zip(zip_path)
|
||||
return find_whisper_exe()
|
||||
|
||||
print(" No zip asset found in community release")
|
||||
return None
|
||||
|
||||
|
||||
def try_official_vulkan_build() -> str | None:
|
||||
"""Try downloading Vulkan build from official ggml-org releases."""
|
||||
print("\n Fetching latest whisper.cpp release from ggml-org...")
|
||||
release = fetch_release(GITHUB_API)
|
||||
if not release:
|
||||
return None
|
||||
|
||||
tag = release.get("tag_name", "unknown")
|
||||
print(f" Official release: {tag}")
|
||||
|
||||
# Priority: vulkan > noavx (cpu-only, no CUDA deps) > skip CUDA entirely
|
||||
vulkan_asset = None
|
||||
cpu_asset = None
|
||||
for asset in release.get("assets", []):
|
||||
name = asset["name"].lower()
|
||||
if not name.endswith(".zip"):
|
||||
continue
|
||||
# Must be Windows
|
||||
if "win" not in name and "x64" not in name:
|
||||
continue
|
||||
# Absolutely skip CUDA builds - they won't work on AMD
|
||||
if "cuda" in name:
|
||||
continue
|
||||
if "vulkan" in name:
|
||||
vulkan_asset = asset
|
||||
break
|
||||
if "noavx" not in name and "openblas" not in name:
|
||||
cpu_asset = asset
|
||||
|
||||
chosen = vulkan_asset or cpu_asset
|
||||
if not chosen:
|
||||
print(" No Vulkan or CPU-only build found in official releases")
|
||||
print(" Available assets:")
|
||||
for asset in release.get("assets", []):
|
||||
print(f" - {asset['name']}")
|
||||
return None
|
||||
|
||||
if vulkan_asset:
|
||||
print(f" Found official Vulkan build: {chosen['name']}")
|
||||
else:
|
||||
print(f" No Vulkan build in official release, using CPU build: {chosen['name']}")
|
||||
print(f" (Will work but without GPU acceleration)")
|
||||
|
||||
zip_path = Path(chosen["name"])
|
||||
download_file(chosen["browser_download_url"], zip_path, chosen["name"])
|
||||
extract_zip(zip_path)
|
||||
return find_whisper_exe()
|
||||
|
||||
|
||||
def setup_whisper_bin() -> str | None:
|
||||
"""Download whisper.cpp Vulkan release. Returns path to whisper-cli.exe."""
|
||||
whisper_exe = WHISPER_DIR / "whisper-cli.exe"
|
||||
if whisper_exe.exists():
|
||||
# Check if it's a CUDA build (has CUDA DLLs but no Vulkan DLL)
|
||||
has_cuda = (WHISPER_DIR / "ggml-cuda.dll").exists()
|
||||
has_vulkan = (WHISPER_DIR / "ggml-vulkan.dll").exists()
|
||||
if has_cuda and not has_vulkan:
|
||||
print(f" WARNING: Existing install is a CUDA build (won't work on AMD GPU)")
|
||||
print(f" Removing and re-downloading Vulkan build...")
|
||||
import shutil
|
||||
shutil.rmtree(WHISPER_DIR)
|
||||
else:
|
||||
print(f" whisper-cli.exe already exists at {whisper_exe}")
|
||||
return str(whisper_exe)
|
||||
|
||||
# Strategy: try community Vulkan build first (reliable for AMD),
|
||||
# then fall back to official release
|
||||
exe_path = try_community_vulkan_build()
|
||||
if exe_path:
|
||||
print(f"\n whisper-cli.exe ready at: {exe_path} (Vulkan)")
|
||||
return exe_path
|
||||
|
||||
print("\n Community build failed, trying official release...")
|
||||
exe_path = try_official_vulkan_build()
|
||||
if exe_path:
|
||||
print(f"\n whisper-cli.exe ready at: {exe_path}")
|
||||
return exe_path
|
||||
|
||||
print("\n ERROR: Could not download whisper.cpp")
|
||||
print(" Manual install: https://github.com/ggml-org/whisper.cpp/releases")
|
||||
print(" Build from source with: cmake -DGGML_VULKAN=1")
|
||||
return None
|
||||
|
||||
|
||||
FFMPEG_DIR = Path("ffmpeg-bin")
|
||||
FFMPEG_URL = "https://www.gyan.dev/ffmpeg/builds/ffmpeg-release-essentials.zip"
|
||||
|
||||
|
||||
def setup_ffmpeg() -> str | None:
|
||||
"""Download ffmpeg if not found. Returns path to ffmpeg.exe."""
|
||||
import shutil
|
||||
|
||||
# Already in PATH?
|
||||
if shutil.which("ffmpeg"):
|
||||
path = shutil.which("ffmpeg")
|
||||
print(f" ffmpeg already in PATH: {path}")
|
||||
return path
|
||||
|
||||
# Already downloaded locally?
|
||||
local_exe = FFMPEG_DIR / "ffmpeg.exe"
|
||||
if local_exe.exists():
|
||||
print(f" ffmpeg already exists at {local_exe}")
|
||||
return str(local_exe)
|
||||
|
||||
print("\n Downloading ffmpeg (essentials build)...")
|
||||
zip_path = Path("ffmpeg-essentials.zip")
|
||||
download_file(FFMPEG_URL, zip_path, "ffmpeg")
|
||||
|
||||
print(f"\n Extracting ffmpeg...")
|
||||
FFMPEG_DIR.mkdir(exist_ok=True)
|
||||
with zipfile.ZipFile(zip_path) as zf:
|
||||
for member in zf.namelist():
|
||||
# Only extract the bin/*.exe files
|
||||
if member.endswith(".exe"):
|
||||
filename = Path(member).name
|
||||
target = FFMPEG_DIR / filename
|
||||
with zf.open(member) as src, open(target, "wb") as dst:
|
||||
dst.write(src.read())
|
||||
print(f" {filename}")
|
||||
|
||||
zip_path.unlink()
|
||||
|
||||
if local_exe.exists():
|
||||
print(f"\n ffmpeg ready at: {local_exe}")
|
||||
return str(local_exe)
|
||||
|
||||
print(" ERROR: ffmpeg.exe not found after extraction")
|
||||
return None
|
||||
|
||||
|
||||
def setup_model() -> bool:
|
||||
"""Download whisper model. Returns True on success."""
|
||||
MODELS_DIR.mkdir(exist_ok=True)
|
||||
model_path = MODELS_DIR / MODEL_NAME
|
||||
|
||||
if model_path.exists() and model_path.stat().st_size > 100_000_000:
|
||||
print(f" Model already exists: {model_path} ({model_path.stat().st_size / 1_048_576:.0f} MB)")
|
||||
return True
|
||||
|
||||
download_file(MODEL_URL, model_path, f"Whisper model ({MODEL_NAME})")
|
||||
|
||||
if model_path.exists() and model_path.stat().st_size > 100_000_000:
|
||||
return True
|
||||
|
||||
print(" ERROR: Model file too small or missing after download")
|
||||
return False
|
||||
|
||||
|
||||
def main():
|
||||
what = sys.argv[1] if len(sys.argv) > 1 else "all"
|
||||
|
||||
if what in ("all", "ffmpeg"):
|
||||
print("=" * 60)
|
||||
print(" Setting up ffmpeg")
|
||||
print("=" * 60)
|
||||
ffmpeg_path = setup_ffmpeg()
|
||||
if ffmpeg_path:
|
||||
Path(".ffmpeg_bin_path").write_text(ffmpeg_path)
|
||||
else:
|
||||
print("\nFAILED to set up ffmpeg")
|
||||
if what == "ffmpeg":
|
||||
sys.exit(1)
|
||||
|
||||
if what in ("all", "whisper"):
|
||||
print("=" * 60)
|
||||
print(" Setting up whisper.cpp")
|
||||
print("=" * 60)
|
||||
exe_path = setup_whisper_bin()
|
||||
if exe_path:
|
||||
# Write path to temp file so run.bat can read it
|
||||
Path(".whisper_bin_path").write_text(exe_path)
|
||||
else:
|
||||
print("\nFAILED to set up whisper.cpp")
|
||||
if what == "whisper":
|
||||
sys.exit(1)
|
||||
|
||||
if what in ("all", "model"):
|
||||
print()
|
||||
print("=" * 60)
|
||||
print(f" Downloading Whisper model: {MODEL_NAME}")
|
||||
print("=" * 60)
|
||||
if not setup_model():
|
||||
print("\nFAILED to download model")
|
||||
sys.exit(1)
|
||||
|
||||
print()
|
||||
print("Setup complete!")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
"""
|
||||
Auto-download and setup whisper.cpp (Vulkan) + model for Windows.
|
||||
Called by run.bat when prerequisites are missing.
|
||||
"""
|
||||
|
||||
import io
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
from urllib.request import urlopen, Request
|
||||
|
||||
MODELS_DIR = Path("models")
|
||||
MODEL_NAME = "ggml-medium-q5_0.bin"
|
||||
MODEL_URL = f"https://huggingface.co/ggerganov/whisper.cpp/resolve/main/{MODEL_NAME}"
|
||||
|
||||
GITHUB_API = "https://api.github.com/repos/ggml-org/whisper.cpp/releases/latest"
|
||||
# Community Vulkan builds (for AMD GPUs)
|
||||
VULKAN_BUILDS_API = "https://api.github.com/repos/jerryshell/whisper.cpp-windows-vulkan-bin/releases/latest"
|
||||
WHISPER_DIR = Path("whisper-bin")
|
||||
|
||||
|
||||
def progress_bar(current: int, total: int, width: int = 40):
|
||||
if total <= 0:
|
||||
return
|
||||
pct = current / total
|
||||
filled = int(width * pct)
|
||||
bar = "=" * filled + "-" * (width - filled)
|
||||
mb_done = current / 1_048_576
|
||||
mb_total = total / 1_048_576
|
||||
print(f"\r [{bar}] {pct:.0%} {mb_done:.0f}/{mb_total:.0f} MB", end="", flush=True)
|
||||
|
||||
|
||||
def download_file(url: str, dest: Path, desc: str):
|
||||
"""Download a file with progress bar."""
|
||||
print(f"\n Downloading {desc}...")
|
||||
print(f" URL: {url}")
|
||||
|
||||
req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
|
||||
resp = urlopen(req, timeout=60)
|
||||
|
||||
total = int(resp.headers.get("Content-Length", 0))
|
||||
downloaded = 0
|
||||
tmp = dest.with_suffix(".tmp")
|
||||
|
||||
with open(tmp, "wb") as f:
|
||||
while True:
|
||||
chunk = resp.read(1024 * 1024)
|
||||
if not chunk:
|
||||
break
|
||||
f.write(chunk)
|
||||
downloaded += len(chunk)
|
||||
progress_bar(downloaded, total)
|
||||
|
||||
print() # newline after progress bar
|
||||
tmp.rename(dest)
|
||||
print(f" Saved: {dest} ({downloaded / 1_048_576:.0f} MB)")
|
||||
|
||||
|
||||
def fetch_release(api_url: str) -> dict | None:
|
||||
"""Fetch a GitHub release JSON."""
|
||||
req = Request(api_url, headers={"User-Agent": "Mozilla/5.0"})
|
||||
try:
|
||||
resp = urlopen(req, timeout=30)
|
||||
return json.loads(resp.read())
|
||||
except Exception as e:
|
||||
print(f" Could not fetch from {api_url}: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def extract_zip(zip_path: Path):
|
||||
"""Extract zip contents into WHISPER_DIR, flattened."""
|
||||
print(f"\n Extracting to {WHISPER_DIR}/...")
|
||||
WHISPER_DIR.mkdir(exist_ok=True)
|
||||
with zipfile.ZipFile(zip_path) as zf:
|
||||
for member in zf.namelist():
|
||||
filename = Path(member).name
|
||||
if not filename:
|
||||
continue
|
||||
target = WHISPER_DIR / filename
|
||||
with zf.open(member) as src, open(target, "wb") as dst:
|
||||
dst.write(src.read())
|
||||
print(f" {filename}")
|
||||
zip_path.unlink()
|
||||
|
||||
|
||||
def find_whisper_exe() -> str | None:
|
||||
"""Find whisper-cli.exe (or similar) in WHISPER_DIR."""
|
||||
whisper_exe = WHISPER_DIR / "whisper-cli.exe"
|
||||
if whisper_exe.exists():
|
||||
return str(whisper_exe)
|
||||
|
||||
# Try main.exe (older naming)
|
||||
main_exe = WHISPER_DIR / "main.exe"
|
||||
if main_exe.exists():
|
||||
return str(main_exe)
|
||||
|
||||
exes = list(WHISPER_DIR.glob("*.exe"))
|
||||
for exe in exes:
|
||||
if "whisper" in exe.name.lower() and "cli" in exe.name.lower():
|
||||
return str(exe)
|
||||
for exe in exes:
|
||||
if "whisper" in exe.name.lower():
|
||||
return str(exe)
|
||||
if exes:
|
||||
return str(exes[0])
|
||||
return None
|
||||
|
||||
|
||||
def try_community_vulkan_build() -> str | None:
|
||||
"""Try downloading Vulkan build from jerryshell's community repo."""
|
||||
print("\n Trying community Vulkan build (jerryshell/whisper.cpp-windows-vulkan-bin)...")
|
||||
release = fetch_release(VULKAN_BUILDS_API)
|
||||
if not release:
|
||||
return None
|
||||
|
||||
tag = release.get("tag_name", "unknown")
|
||||
print(f" Community release: {tag}")
|
||||
|
||||
# Find a zip asset
|
||||
for asset in release.get("assets", []):
|
||||
name = asset["name"].lower()
|
||||
if name.endswith(".zip"):
|
||||
print(f" Found: {asset['name']}")
|
||||
zip_path = Path(asset["name"])
|
||||
download_file(asset["browser_download_url"], zip_path, asset["name"])
|
||||
extract_zip(zip_path)
|
||||
return find_whisper_exe()
|
||||
|
||||
print(" No zip asset found in community release")
|
||||
return None
|
||||
|
||||
|
||||
def try_official_vulkan_build() -> str | None:
|
||||
"""Try downloading Vulkan build from official ggml-org releases."""
|
||||
print("\n Fetching latest whisper.cpp release from ggml-org...")
|
||||
release = fetch_release(GITHUB_API)
|
||||
if not release:
|
||||
return None
|
||||
|
||||
tag = release.get("tag_name", "unknown")
|
||||
print(f" Official release: {tag}")
|
||||
|
||||
# Priority: vulkan > noavx (cpu-only, no CUDA deps) > skip CUDA entirely
|
||||
vulkan_asset = None
|
||||
cpu_asset = None
|
||||
for asset in release.get("assets", []):
|
||||
name = asset["name"].lower()
|
||||
if not name.endswith(".zip"):
|
||||
continue
|
||||
# Must be Windows
|
||||
if "win" not in name and "x64" not in name:
|
||||
continue
|
||||
# Absolutely skip CUDA builds - they won't work on AMD
|
||||
if "cuda" in name:
|
||||
continue
|
||||
if "vulkan" in name:
|
||||
vulkan_asset = asset
|
||||
break
|
||||
if "noavx" not in name and "openblas" not in name:
|
||||
cpu_asset = asset
|
||||
|
||||
chosen = vulkan_asset or cpu_asset
|
||||
if not chosen:
|
||||
print(" No Vulkan or CPU-only build found in official releases")
|
||||
print(" Available assets:")
|
||||
for asset in release.get("assets", []):
|
||||
print(f" - {asset['name']}")
|
||||
return None
|
||||
|
||||
if vulkan_asset:
|
||||
print(f" Found official Vulkan build: {chosen['name']}")
|
||||
else:
|
||||
print(f" No Vulkan build in official release, using CPU build: {chosen['name']}")
|
||||
print(f" (Will work but without GPU acceleration)")
|
||||
|
||||
zip_path = Path(chosen["name"])
|
||||
download_file(chosen["browser_download_url"], zip_path, chosen["name"])
|
||||
extract_zip(zip_path)
|
||||
return find_whisper_exe()
|
||||
|
||||
|
||||
def setup_whisper_bin() -> str | None:
|
||||
"""Download whisper.cpp Vulkan release. Returns path to whisper-cli.exe."""
|
||||
whisper_exe = WHISPER_DIR / "whisper-cli.exe"
|
||||
if whisper_exe.exists():
|
||||
# Check if it's a CUDA build (has CUDA DLLs but no Vulkan DLL)
|
||||
has_cuda = (WHISPER_DIR / "ggml-cuda.dll").exists()
|
||||
has_vulkan = (WHISPER_DIR / "ggml-vulkan.dll").exists()
|
||||
if has_cuda and not has_vulkan:
|
||||
print(f" WARNING: Existing install is a CUDA build (won't work on AMD GPU)")
|
||||
print(f" Removing and re-downloading Vulkan build...")
|
||||
import shutil
|
||||
shutil.rmtree(WHISPER_DIR)
|
||||
else:
|
||||
print(f" whisper-cli.exe already exists at {whisper_exe}")
|
||||
return str(whisper_exe)
|
||||
|
||||
# Strategy: try community Vulkan build first (reliable for AMD),
|
||||
# then fall back to official release
|
||||
exe_path = try_community_vulkan_build()
|
||||
if exe_path:
|
||||
print(f"\n whisper-cli.exe ready at: {exe_path} (Vulkan)")
|
||||
return exe_path
|
||||
|
||||
print("\n Community build failed, trying official release...")
|
||||
exe_path = try_official_vulkan_build()
|
||||
if exe_path:
|
||||
print(f"\n whisper-cli.exe ready at: {exe_path}")
|
||||
return exe_path
|
||||
|
||||
print("\n ERROR: Could not download whisper.cpp")
|
||||
print(" Manual install: https://github.com/ggml-org/whisper.cpp/releases")
|
||||
print(" Build from source with: cmake -DGGML_VULKAN=1")
|
||||
return None
|
||||
|
||||
|
||||
FFMPEG_DIR = Path("ffmpeg-bin")
|
||||
FFMPEG_URL = "https://www.gyan.dev/ffmpeg/builds/ffmpeg-release-essentials.zip"
|
||||
|
||||
|
||||
def setup_ffmpeg() -> str | None:
|
||||
"""Download ffmpeg if not found. Returns path to ffmpeg.exe."""
|
||||
import shutil
|
||||
|
||||
# Already in PATH?
|
||||
if shutil.which("ffmpeg"):
|
||||
path = shutil.which("ffmpeg")
|
||||
print(f" ffmpeg already in PATH: {path}")
|
||||
return path
|
||||
|
||||
# Already downloaded locally?
|
||||
local_exe = FFMPEG_DIR / "ffmpeg.exe"
|
||||
if local_exe.exists():
|
||||
print(f" ffmpeg already exists at {local_exe}")
|
||||
return str(local_exe)
|
||||
|
||||
print("\n Downloading ffmpeg (essentials build)...")
|
||||
zip_path = Path("ffmpeg-essentials.zip")
|
||||
download_file(FFMPEG_URL, zip_path, "ffmpeg")
|
||||
|
||||
print(f"\n Extracting ffmpeg...")
|
||||
FFMPEG_DIR.mkdir(exist_ok=True)
|
||||
with zipfile.ZipFile(zip_path) as zf:
|
||||
for member in zf.namelist():
|
||||
# Only extract the bin/*.exe files
|
||||
if member.endswith(".exe"):
|
||||
filename = Path(member).name
|
||||
target = FFMPEG_DIR / filename
|
||||
with zf.open(member) as src, open(target, "wb") as dst:
|
||||
dst.write(src.read())
|
||||
print(f" {filename}")
|
||||
|
||||
zip_path.unlink()
|
||||
|
||||
if local_exe.exists():
|
||||
print(f"\n ffmpeg ready at: {local_exe}")
|
||||
return str(local_exe)
|
||||
|
||||
print(" ERROR: ffmpeg.exe not found after extraction")
|
||||
return None
|
||||
|
||||
|
||||
def setup_model() -> bool:
|
||||
"""Download whisper model. Returns True on success."""
|
||||
MODELS_DIR.mkdir(exist_ok=True)
|
||||
model_path = MODELS_DIR / MODEL_NAME
|
||||
|
||||
if model_path.exists() and model_path.stat().st_size > 100_000_000:
|
||||
print(f" Model already exists: {model_path} ({model_path.stat().st_size / 1_048_576:.0f} MB)")
|
||||
return True
|
||||
|
||||
download_file(MODEL_URL, model_path, f"Whisper model ({MODEL_NAME})")
|
||||
|
||||
if model_path.exists() and model_path.stat().st_size > 100_000_000:
|
||||
return True
|
||||
|
||||
print(" ERROR: Model file too small or missing after download")
|
||||
return False
|
||||
|
||||
|
||||
def main():
|
||||
what = sys.argv[1] if len(sys.argv) > 1 else "all"
|
||||
|
||||
if what in ("all", "ffmpeg"):
|
||||
print("=" * 60)
|
||||
print(" Setting up ffmpeg")
|
||||
print("=" * 60)
|
||||
ffmpeg_path = setup_ffmpeg()
|
||||
if ffmpeg_path:
|
||||
Path(".ffmpeg_bin_path").write_text(ffmpeg_path)
|
||||
else:
|
||||
print("\nFAILED to set up ffmpeg")
|
||||
if what == "ffmpeg":
|
||||
sys.exit(1)
|
||||
|
||||
if what in ("all", "whisper"):
|
||||
print("=" * 60)
|
||||
print(" Setting up whisper.cpp")
|
||||
print("=" * 60)
|
||||
exe_path = setup_whisper_bin()
|
||||
if exe_path:
|
||||
# Write path to temp file so run.bat can read it
|
||||
Path(".whisper_bin_path").write_text(exe_path)
|
||||
else:
|
||||
print("\nFAILED to set up whisper.cpp")
|
||||
if what == "whisper":
|
||||
sys.exit(1)
|
||||
|
||||
if what in ("all", "model"):
|
||||
print()
|
||||
print("=" * 60)
|
||||
print(f" Downloading Whisper model: {MODEL_NAME}")
|
||||
print("=" * 60)
|
||||
if not setup_model():
|
||||
print("\nFAILED to download model")
|
||||
sys.exit(1)
|
||||
|
||||
print()
|
||||
print("Setup complete!")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
||||
384
summarize.py
384
summarize.py
@@ -1,192 +1,192 @@
|
||||
"""
|
||||
Generate summaries from transcripts using Claude Code.
|
||||
Reads manifest.json, processes each transcript, outputs per-lecture summaries,
|
||||
and compiles SUPORT_CURS.md master study guide.
|
||||
|
||||
Usage:
|
||||
python summarize.py # Print prompts for each transcript (pipe to Claude)
|
||||
python summarize.py --compile # Compile existing summaries into SUPORT_CURS.md
|
||||
"""
|
||||
|
||||
import json
|
||||
import sys
|
||||
import textwrap
|
||||
from pathlib import Path
|
||||
|
||||
MANIFEST_PATH = Path("manifest.json")
|
||||
SUMMARIES_DIR = Path("summaries")
|
||||
TRANSCRIPTS_DIR = Path("transcripts")
|
||||
MASTER_GUIDE = Path("SUPORT_CURS.md")
|
||||
|
||||
MAX_WORDS_PER_CHUNK = 10000
|
||||
OVERLAP_WORDS = 500
|
||||
|
||||
SUMMARY_PROMPT = """Rezuma aceasta transcriere a unei lectii din cursul NLP Master Practitioner.
|
||||
|
||||
Ofera:
|
||||
1. **Prezentare generala** - 3-5 propozitii care descriu subiectul principal al lectiei
|
||||
2. **Concepte cheie** - lista cu definitii scurte pentru fiecare concept important
|
||||
3. **Detalii si exemple importante** - informatii concrete, exercitii practice, exemple relevante mentionate de trainer
|
||||
4. **Citate memorabile** - fraze sau idei remarcabile (daca exista)
|
||||
|
||||
Raspunde in limba romana. Formateaza ca Markdown.
|
||||
|
||||
---
|
||||
TITLU LECTIE: {title}
|
||||
---
|
||||
TRANSCRIERE:
|
||||
{text}
|
||||
"""
|
||||
|
||||
MERGE_PROMPT = """Am mai multe rezumate partiale ale aceleiasi lectii (a fost prea lunga pentru un singur rezumat).
|
||||
Combina-le intr-un singur rezumat coerent, eliminand duplicatele.
|
||||
|
||||
Pastreaza structura:
|
||||
1. Prezentare generala (3-5 propozitii)
|
||||
2. Concepte cheie cu definitii
|
||||
3. Detalii si exemple importante
|
||||
4. Citate memorabile
|
||||
|
||||
Raspunde in limba romana. Formateaza ca Markdown.
|
||||
|
||||
---
|
||||
TITLU LECTIE: {title}
|
||||
---
|
||||
REZUMATE PARTIALE:
|
||||
{chunks}
|
||||
"""
|
||||
|
||||
|
||||
def load_manifest() -> dict:
|
||||
with open(MANIFEST_PATH, encoding="utf-8") as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def split_at_sentences(text: str, max_words: int, overlap: int) -> list[str]:
|
||||
"""Split text into chunks at sentence boundaries with overlap."""
|
||||
words = text.split()
|
||||
if len(words) <= max_words:
|
||||
return [text]
|
||||
|
||||
chunks = []
|
||||
start = 0
|
||||
while start < len(words):
|
||||
end = min(start + max_words, len(words))
|
||||
chunk_words = words[start:end]
|
||||
chunk_text = " ".join(chunk_words)
|
||||
|
||||
# Try to break at sentence boundary (look back from end)
|
||||
if end < len(words):
|
||||
for sep in [". ", "! ", "? ", ".\n", "!\n", "?\n"]:
|
||||
last_sep = chunk_text.rfind(sep)
|
||||
if last_sep > len(chunk_text) // 2: # Don't break too early
|
||||
chunk_text = chunk_text[:last_sep + 1]
|
||||
# Recalculate end based on actual words used
|
||||
end = start + len(chunk_text.split())
|
||||
break
|
||||
|
||||
chunks.append(chunk_text)
|
||||
start = max(end - overlap, start + 1) # Overlap, but always advance
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
def generate_prompts(manifest: dict):
|
||||
"""Print summary prompts for each transcript to stdout."""
|
||||
SUMMARIES_DIR.mkdir(exist_ok=True)
|
||||
|
||||
for mod in manifest["modules"]:
|
||||
for lec in mod["lectures"]:
|
||||
if lec.get("transcribe_status") != "complete":
|
||||
continue
|
||||
|
||||
summary_path = Path(lec["summary_path"])
|
||||
if summary_path.exists() and summary_path.stat().st_size > 0:
|
||||
print(f"# SKIP (exists): {lec['title']}", file=sys.stderr)
|
||||
continue
|
||||
|
||||
txt_path = Path(lec["transcript_path"])
|
||||
if not txt_path.exists():
|
||||
print(f"# SKIP (no transcript): {lec['title']}", file=sys.stderr)
|
||||
continue
|
||||
|
||||
text = txt_path.read_text(encoding="utf-8").strip()
|
||||
if not text:
|
||||
print(f"# SKIP (empty): {lec['title']}", file=sys.stderr)
|
||||
continue
|
||||
|
||||
chunks = split_at_sentences(text, MAX_WORDS_PER_CHUNK, OVERLAP_WORDS)
|
||||
|
||||
print(f"\n{'='*60}", file=sys.stderr)
|
||||
print(f"Lecture: {lec['title']}", file=sys.stderr)
|
||||
print(f"Words: {len(text.split())}, Chunks: {len(chunks)}", file=sys.stderr)
|
||||
print(f"Output: {summary_path}", file=sys.stderr)
|
||||
|
||||
if len(chunks) == 1:
|
||||
prompt = SUMMARY_PROMPT.format(title=lec["title"], text=text)
|
||||
print(f"SUMMARY_FILE:{summary_path}")
|
||||
print(prompt)
|
||||
print("---END_PROMPT---")
|
||||
else:
|
||||
# Multi-chunk: generate individual chunk prompts
|
||||
for i, chunk in enumerate(chunks, 1):
|
||||
prompt = SUMMARY_PROMPT.format(
|
||||
title=f"{lec['title']} (partea {i}/{len(chunks)})",
|
||||
text=chunk,
|
||||
)
|
||||
print(f"CHUNK_PROMPT:{i}/{len(chunks)}:{summary_path}")
|
||||
print(prompt)
|
||||
print("---END_PROMPT---")
|
||||
|
||||
# Then a merge prompt
|
||||
print(f"MERGE_FILE:{summary_path}")
|
||||
merge = MERGE_PROMPT.format(
|
||||
title=lec["title"],
|
||||
chunks="{chunk_summaries}", # Placeholder for merge step
|
||||
)
|
||||
print(merge)
|
||||
print("---END_PROMPT---")
|
||||
|
||||
|
||||
def compile_master_guide(manifest: dict):
|
||||
"""Compile all summaries into SUPORT_CURS.md."""
|
||||
lines = [
|
||||
"# SUPORT CURS - NLP Master Practitioner Bucuresti 2025\n",
|
||||
"_Generat automat din transcrierile audio ale cursului._\n",
|
||||
"---\n",
|
||||
]
|
||||
|
||||
for mod in manifest["modules"]:
|
||||
lines.append(f"\n## {mod['name']}\n")
|
||||
|
||||
for lec in mod["lectures"]:
|
||||
summary_path = Path(lec["summary_path"])
|
||||
lines.append(f"\n### {lec['title']}\n")
|
||||
|
||||
if summary_path.exists():
|
||||
content = summary_path.read_text(encoding="utf-8").strip()
|
||||
lines.append(f"{content}\n")
|
||||
else:
|
||||
lines.append("_Rezumat indisponibil._\n")
|
||||
|
||||
lines.append("\n---\n")
|
||||
|
||||
MASTER_GUIDE.write_text("\n".join(lines), encoding="utf-8")
|
||||
print(f"Compiled {MASTER_GUIDE} ({MASTER_GUIDE.stat().st_size} bytes)")
|
||||
|
||||
|
||||
def main():
|
||||
if not MANIFEST_PATH.exists():
|
||||
print("manifest.json not found. Run download.py and transcribe.py first.")
|
||||
sys.exit(1)
|
||||
|
||||
manifest = load_manifest()
|
||||
|
||||
if "--compile" in sys.argv:
|
||||
compile_master_guide(manifest)
|
||||
else:
|
||||
generate_prompts(manifest)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
"""
|
||||
Generate summaries from transcripts using Claude Code.
|
||||
Reads manifest.json, processes each transcript, outputs per-lecture summaries,
|
||||
and compiles SUPORT_CURS.md master study guide.
|
||||
|
||||
Usage:
|
||||
python summarize.py # Print prompts for each transcript (pipe to Claude)
|
||||
python summarize.py --compile # Compile existing summaries into SUPORT_CURS.md
|
||||
"""
|
||||
|
||||
import json
|
||||
import sys
|
||||
import textwrap
|
||||
from pathlib import Path
|
||||
|
||||
MANIFEST_PATH = Path("manifest.json")
|
||||
SUMMARIES_DIR = Path("summaries")
|
||||
TRANSCRIPTS_DIR = Path("transcripts")
|
||||
MASTER_GUIDE = Path("SUPORT_CURS.md")
|
||||
|
||||
MAX_WORDS_PER_CHUNK = 10000
|
||||
OVERLAP_WORDS = 500
|
||||
|
||||
SUMMARY_PROMPT = """Rezuma aceasta transcriere a unei lectii din cursul NLP Master Practitioner.
|
||||
|
||||
Ofera:
|
||||
1. **Prezentare generala** - 3-5 propozitii care descriu subiectul principal al lectiei
|
||||
2. **Concepte cheie** - lista cu definitii scurte pentru fiecare concept important
|
||||
3. **Detalii si exemple importante** - informatii concrete, exercitii practice, exemple relevante mentionate de trainer
|
||||
4. **Citate memorabile** - fraze sau idei remarcabile (daca exista)
|
||||
|
||||
Raspunde in limba romana. Formateaza ca Markdown.
|
||||
|
||||
---
|
||||
TITLU LECTIE: {title}
|
||||
---
|
||||
TRANSCRIERE:
|
||||
{text}
|
||||
"""
|
||||
|
||||
MERGE_PROMPT = """Am mai multe rezumate partiale ale aceleiasi lectii (a fost prea lunga pentru un singur rezumat).
|
||||
Combina-le intr-un singur rezumat coerent, eliminand duplicatele.
|
||||
|
||||
Pastreaza structura:
|
||||
1. Prezentare generala (3-5 propozitii)
|
||||
2. Concepte cheie cu definitii
|
||||
3. Detalii si exemple importante
|
||||
4. Citate memorabile
|
||||
|
||||
Raspunde in limba romana. Formateaza ca Markdown.
|
||||
|
||||
---
|
||||
TITLU LECTIE: {title}
|
||||
---
|
||||
REZUMATE PARTIALE:
|
||||
{chunks}
|
||||
"""
|
||||
|
||||
|
||||
def load_manifest() -> dict:
|
||||
with open(MANIFEST_PATH, encoding="utf-8") as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def split_at_sentences(text: str, max_words: int, overlap: int) -> list[str]:
|
||||
"""Split text into chunks at sentence boundaries with overlap."""
|
||||
words = text.split()
|
||||
if len(words) <= max_words:
|
||||
return [text]
|
||||
|
||||
chunks = []
|
||||
start = 0
|
||||
while start < len(words):
|
||||
end = min(start + max_words, len(words))
|
||||
chunk_words = words[start:end]
|
||||
chunk_text = " ".join(chunk_words)
|
||||
|
||||
# Try to break at sentence boundary (look back from end)
|
||||
if end < len(words):
|
||||
for sep in [". ", "! ", "? ", ".\n", "!\n", "?\n"]:
|
||||
last_sep = chunk_text.rfind(sep)
|
||||
if last_sep > len(chunk_text) // 2: # Don't break too early
|
||||
chunk_text = chunk_text[:last_sep + 1]
|
||||
# Recalculate end based on actual words used
|
||||
end = start + len(chunk_text.split())
|
||||
break
|
||||
|
||||
chunks.append(chunk_text)
|
||||
start = max(end - overlap, start + 1) # Overlap, but always advance
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
def generate_prompts(manifest: dict):
|
||||
"""Print summary prompts for each transcript to stdout."""
|
||||
SUMMARIES_DIR.mkdir(exist_ok=True)
|
||||
|
||||
for mod in manifest["modules"]:
|
||||
for lec in mod["lectures"]:
|
||||
if lec.get("transcribe_status") != "complete":
|
||||
continue
|
||||
|
||||
summary_path = Path(lec["summary_path"])
|
||||
if summary_path.exists() and summary_path.stat().st_size > 0:
|
||||
print(f"# SKIP (exists): {lec['title']}", file=sys.stderr)
|
||||
continue
|
||||
|
||||
txt_path = Path(lec["transcript_path"])
|
||||
if not txt_path.exists():
|
||||
print(f"# SKIP (no transcript): {lec['title']}", file=sys.stderr)
|
||||
continue
|
||||
|
||||
text = txt_path.read_text(encoding="utf-8").strip()
|
||||
if not text:
|
||||
print(f"# SKIP (empty): {lec['title']}", file=sys.stderr)
|
||||
continue
|
||||
|
||||
chunks = split_at_sentences(text, MAX_WORDS_PER_CHUNK, OVERLAP_WORDS)
|
||||
|
||||
print(f"\n{'='*60}", file=sys.stderr)
|
||||
print(f"Lecture: {lec['title']}", file=sys.stderr)
|
||||
print(f"Words: {len(text.split())}, Chunks: {len(chunks)}", file=sys.stderr)
|
||||
print(f"Output: {summary_path}", file=sys.stderr)
|
||||
|
||||
if len(chunks) == 1:
|
||||
prompt = SUMMARY_PROMPT.format(title=lec["title"], text=text)
|
||||
print(f"SUMMARY_FILE:{summary_path}")
|
||||
print(prompt)
|
||||
print("---END_PROMPT---")
|
||||
else:
|
||||
# Multi-chunk: generate individual chunk prompts
|
||||
for i, chunk in enumerate(chunks, 1):
|
||||
prompt = SUMMARY_PROMPT.format(
|
||||
title=f"{lec['title']} (partea {i}/{len(chunks)})",
|
||||
text=chunk,
|
||||
)
|
||||
print(f"CHUNK_PROMPT:{i}/{len(chunks)}:{summary_path}")
|
||||
print(prompt)
|
||||
print("---END_PROMPT---")
|
||||
|
||||
# Then a merge prompt
|
||||
print(f"MERGE_FILE:{summary_path}")
|
||||
merge = MERGE_PROMPT.format(
|
||||
title=lec["title"],
|
||||
chunks="{chunk_summaries}", # Placeholder for merge step
|
||||
)
|
||||
print(merge)
|
||||
print("---END_PROMPT---")
|
||||
|
||||
|
||||
def compile_master_guide(manifest: dict):
|
||||
"""Compile all summaries into SUPORT_CURS.md."""
|
||||
lines = [
|
||||
"# SUPORT CURS - NLP Master Practitioner Bucuresti 2025\n",
|
||||
"_Generat automat din transcrierile audio ale cursului._\n",
|
||||
"---\n",
|
||||
]
|
||||
|
||||
for mod in manifest["modules"]:
|
||||
lines.append(f"\n## {mod['name']}\n")
|
||||
|
||||
for lec in mod["lectures"]:
|
||||
summary_path = Path(lec["summary_path"])
|
||||
lines.append(f"\n### {lec['title']}\n")
|
||||
|
||||
if summary_path.exists():
|
||||
content = summary_path.read_text(encoding="utf-8").strip()
|
||||
lines.append(f"{content}\n")
|
||||
else:
|
||||
lines.append("_Rezumat indisponibil._\n")
|
||||
|
||||
lines.append("\n---\n")
|
||||
|
||||
MASTER_GUIDE.write_text("\n".join(lines), encoding="utf-8")
|
||||
print(f"Compiled {MASTER_GUIDE} ({MASTER_GUIDE.stat().st_size} bytes)")
|
||||
|
||||
|
||||
def main():
|
||||
if not MANIFEST_PATH.exists():
|
||||
print("manifest.json not found. Run download.py and transcribe.py first.")
|
||||
sys.exit(1)
|
||||
|
||||
manifest = load_manifest()
|
||||
|
||||
if "--compile" in sys.argv:
|
||||
compile_master_guide(manifest)
|
||||
else:
|
||||
generate_prompts(manifest)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
||||
598
transcribe.py
598
transcribe.py
@@ -1,299 +1,299 @@
|
||||
"""
|
||||
Batch transcription using whisper.cpp.
|
||||
Reads manifest.json, transcribes each audio file in module order,
|
||||
outputs .txt and .srt files, updates manifest status.
|
||||
Resumable: skips files with existing transcripts.
|
||||
Converts MP3 -> WAV (16kHz mono) via ffmpeg before transcription.
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import shutil
|
||||
import subprocess
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
MANIFEST_PATH = Path("manifest.json")
|
||||
TRANSCRIPTS_DIR = Path("transcripts")
|
||||
WAV_CACHE_DIR = Path("audio_wav")
|
||||
|
||||
# whisper.cpp defaults — override with env vars or CLI args
|
||||
WHISPER_BIN = os.getenv("WHISPER_BIN", r"whisper-cli.exe")
|
||||
WHISPER_MODEL = os.getenv("WHISPER_MODEL", r"models\ggml-medium-q5_0.bin")
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s [%(levelname)s] %(message)s",
|
||||
handlers=[
|
||||
logging.StreamHandler(),
|
||||
logging.FileHandler("transcribe_errors.log"),
|
||||
],
|
||||
)
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def find_ffmpeg() -> str:
|
||||
"""Find ffmpeg executable."""
|
||||
if shutil.which("ffmpeg"):
|
||||
return "ffmpeg"
|
||||
# Check local directories
|
||||
for p in [Path("ffmpeg.exe"), Path("ffmpeg-bin/ffmpeg.exe")]:
|
||||
if p.exists():
|
||||
return str(p.resolve())
|
||||
# Try imageio-ffmpeg (pip fallback)
|
||||
try:
|
||||
import imageio_ffmpeg
|
||||
return imageio_ffmpeg.get_ffmpeg_exe()
|
||||
except ImportError:
|
||||
pass
|
||||
return ""
|
||||
|
||||
|
||||
def convert_to_wav(audio_path: str) -> str:
|
||||
"""
|
||||
Convert audio file to WAV 16kHz mono (optimal for whisper.cpp).
|
||||
Returns path to WAV file. Skips if WAV already exists.
|
||||
"""
|
||||
src = Path(audio_path)
|
||||
|
||||
# Already a WAV file, skip
|
||||
if src.suffix.lower() == ".wav":
|
||||
return audio_path
|
||||
|
||||
WAV_CACHE_DIR.mkdir(exist_ok=True)
|
||||
wav_path = WAV_CACHE_DIR / (src.stem + ".wav")
|
||||
|
||||
# Skip if already converted
|
||||
if wav_path.exists() and wav_path.stat().st_size > 0:
|
||||
log.info(f" WAV cache hit: {wav_path}")
|
||||
return str(wav_path)
|
||||
|
||||
ffmpeg = find_ffmpeg()
|
||||
if not ffmpeg:
|
||||
log.warning(" ffmpeg not found, using original file (may cause bad transcription)")
|
||||
return audio_path
|
||||
|
||||
log.info(f" Converting to WAV: {src.name} -> {wav_path.name}")
|
||||
cmd = [
|
||||
ffmpeg,
|
||||
"-i", audio_path,
|
||||
"-vn", # no video
|
||||
"-acodec", "pcm_s16le", # 16-bit PCM
|
||||
"-ar", "16000", # 16kHz sample rate (whisper standard)
|
||||
"-ac", "1", # mono
|
||||
"-y", # overwrite
|
||||
str(wav_path),
|
||||
]
|
||||
|
||||
try:
|
||||
result = subprocess.run(
|
||||
cmd,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=300, # 5 min max for conversion
|
||||
)
|
||||
if result.returncode != 0:
|
||||
log.error(f" ffmpeg failed: {result.stderr[:300]}")
|
||||
return audio_path
|
||||
|
||||
log.info(f" WAV ready: {wav_path.name} ({wav_path.stat().st_size / 1_048_576:.0f} MB)")
|
||||
return str(wav_path)
|
||||
|
||||
except FileNotFoundError:
|
||||
log.warning(f" ffmpeg not found at: {ffmpeg}")
|
||||
return audio_path
|
||||
except subprocess.TimeoutExpired:
|
||||
log.error(f" ffmpeg conversion timeout for {audio_path}")
|
||||
return audio_path
|
||||
|
||||
|
||||
def load_manifest() -> dict:
|
||||
with open(MANIFEST_PATH, encoding="utf-8") as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def save_manifest(manifest: dict):
|
||||
with open(MANIFEST_PATH, "w", encoding="utf-8") as f:
|
||||
json.dump(manifest, f, indent=2, ensure_ascii=False)
|
||||
|
||||
|
||||
def transcribe_file(audio_path: str, output_base: str) -> bool:
|
||||
"""
|
||||
Run whisper.cpp on a single file.
|
||||
Returns True on success.
|
||||
"""
|
||||
cmd = [
|
||||
WHISPER_BIN,
|
||||
"--model", WHISPER_MODEL,
|
||||
"--language", "ro",
|
||||
"--no-gpu",
|
||||
"--threads", str(os.cpu_count() or 4),
|
||||
"--beam-size", "1",
|
||||
"--best-of", "1",
|
||||
"--output-txt",
|
||||
"--output-srt",
|
||||
"--output-file", output_base,
|
||||
"--file", audio_path,
|
||||
]
|
||||
|
||||
log.info(f" CMD: {' '.join(cmd)}")
|
||||
|
||||
try:
|
||||
# Add whisper.exe's directory to PATH so Windows finds its DLLs
|
||||
env = os.environ.copy()
|
||||
whisper_dir = str(Path(WHISPER_BIN).resolve().parent)
|
||||
env["PATH"] = whisper_dir + os.pathsep + env.get("PATH", "")
|
||||
|
||||
result = subprocess.run(
|
||||
cmd,
|
||||
stdout=sys.stdout,
|
||||
stderr=sys.stderr,
|
||||
timeout=7200, # 2 hour timeout per file
|
||||
env=env,
|
||||
)
|
||||
|
||||
if result.returncode != 0:
|
||||
log.error(f" whisper.cpp failed (exit {result.returncode})")
|
||||
return False
|
||||
|
||||
# Verify output exists and is non-empty
|
||||
txt_path = Path(f"{output_base}.txt")
|
||||
srt_path = Path(f"{output_base}.srt")
|
||||
|
||||
if not txt_path.exists() or txt_path.stat().st_size == 0:
|
||||
log.error(f" Empty or missing transcript: {txt_path}")
|
||||
return False
|
||||
|
||||
log.info(f" Output: {txt_path.name} ({txt_path.stat().st_size} bytes)")
|
||||
if srt_path.exists():
|
||||
log.info(f" Output: {srt_path.name} ({srt_path.stat().st_size} bytes)")
|
||||
|
||||
return True
|
||||
|
||||
except subprocess.TimeoutExpired:
|
||||
log.error(f" Timeout (>2h) for {audio_path}")
|
||||
return False
|
||||
except FileNotFoundError:
|
||||
log.error(f" whisper.cpp not found at: {WHISPER_BIN}")
|
||||
log.error(f" Set WHISPER_BIN env var or put whisper-cli.exe in PATH")
|
||||
return False
|
||||
except Exception as e:
|
||||
log.error(f" Error: {e}")
|
||||
return False
|
||||
|
||||
|
||||
def parse_module_filter(arg: str) -> set[int]:
|
||||
"""Parse module filter like '1-3' or '4,5' or '1-3,5' into a set of 1-based indices."""
|
||||
result = set()
|
||||
for part in arg.split(","):
|
||||
part = part.strip()
|
||||
if "-" in part:
|
||||
a, b = part.split("-", 1)
|
||||
result.update(range(int(a), int(b) + 1))
|
||||
else:
|
||||
result.add(int(part))
|
||||
return result
|
||||
|
||||
|
||||
def main():
|
||||
if not MANIFEST_PATH.exists():
|
||||
log.error("manifest.json not found. Run download.py first.")
|
||||
sys.exit(1)
|
||||
|
||||
# Parse --modules filter
|
||||
module_filter = None
|
||||
if "--modules" in sys.argv:
|
||||
idx = sys.argv.index("--modules")
|
||||
if idx + 1 < len(sys.argv):
|
||||
module_filter = parse_module_filter(sys.argv[idx + 1])
|
||||
log.info(f"Module filter: {sorted(module_filter)}")
|
||||
|
||||
manifest = load_manifest()
|
||||
TRANSCRIPTS_DIR.mkdir(exist_ok=True)
|
||||
|
||||
total = 0
|
||||
transcribed = 0
|
||||
skipped = 0
|
||||
failed = 0
|
||||
|
||||
for mod_idx, mod in enumerate(manifest["modules"], 1):
|
||||
if module_filter and mod_idx not in module_filter:
|
||||
log.info(f"\nSkipping module {mod_idx}: {mod['name']}")
|
||||
continue
|
||||
log.info(f"\n{'='*60}")
|
||||
log.info(f"Module: {mod['name']}")
|
||||
log.info(f"{'='*60}")
|
||||
|
||||
for lec in mod["lectures"]:
|
||||
total += 1
|
||||
|
||||
if lec.get("download_status") != "complete":
|
||||
log.warning(f" Skipping (not downloaded): {lec['title']}")
|
||||
continue
|
||||
|
||||
audio_path = lec["audio_path"]
|
||||
stem = Path(lec["original_filename"]).stem.replace(" [Audio]", "")
|
||||
output_base = str(TRANSCRIPTS_DIR / stem)
|
||||
|
||||
# Check if already transcribed
|
||||
txt_path = Path(f"{output_base}.txt")
|
||||
if txt_path.exists() and txt_path.stat().st_size > 0:
|
||||
lec["transcribe_status"] = "complete"
|
||||
skipped += 1
|
||||
log.info(f" Skipping (exists): {stem}.txt")
|
||||
continue
|
||||
|
||||
log.info(f" Transcribing: {lec['title']}")
|
||||
log.info(f" File: {audio_path}")
|
||||
|
||||
# Convert MP3 -> WAV 16kHz mono for reliable whisper.cpp input
|
||||
wav_path = convert_to_wav(audio_path)
|
||||
|
||||
if transcribe_file(wav_path, output_base):
|
||||
lec["transcribe_status"] = "complete"
|
||||
transcribed += 1
|
||||
else:
|
||||
lec["transcribe_status"] = "failed"
|
||||
failed += 1
|
||||
|
||||
# Save manifest after each file (checkpoint)
|
||||
save_manifest(manifest)
|
||||
|
||||
# Quality gate: pause after first module
|
||||
if mod == manifest["modules"][0] and transcribed > 0:
|
||||
log.info("\n" + "!" * 60)
|
||||
log.info("QUALITY GATE: First module complete.")
|
||||
log.info("Spot-check 2-3 transcripts in transcripts/ before continuing.")
|
||||
log.info("Press Enter to continue, or Ctrl+C to abort...")
|
||||
log.info("!" * 60)
|
||||
try:
|
||||
input()
|
||||
except EOFError:
|
||||
pass # Non-interactive mode, continue
|
||||
|
||||
# Validation
|
||||
empty_outputs = [
|
||||
lec["title"]
|
||||
for mod in manifest["modules"]
|
||||
for lec in mod["lectures"]
|
||||
if lec.get("transcribe_status") == "complete"
|
||||
and not Path(lec["transcript_path"]).exists()
|
||||
]
|
||||
|
||||
log.info("\n" + "=" * 60)
|
||||
log.info(f"Transcribed {transcribed}/{total} files, {skipped} skipped, {failed} failures.")
|
||||
log.info(f"No empty outputs: {'PASS' if not empty_outputs else 'FAIL'}")
|
||||
if empty_outputs:
|
||||
for t in empty_outputs:
|
||||
log.error(f" Missing transcript: {t}")
|
||||
log.info("=" * 60)
|
||||
|
||||
save_manifest(manifest)
|
||||
|
||||
if failed:
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
"""
|
||||
Batch transcription using whisper.cpp.
|
||||
Reads manifest.json, transcribes each audio file in module order,
|
||||
outputs .txt and .srt files, updates manifest status.
|
||||
Resumable: skips files with existing transcripts.
|
||||
Converts MP3 -> WAV (16kHz mono) via ffmpeg before transcription.
|
||||
"""
|
||||
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import shutil
|
||||
import subprocess
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
MANIFEST_PATH = Path("manifest.json")
|
||||
TRANSCRIPTS_DIR = Path("transcripts")
|
||||
WAV_CACHE_DIR = Path("audio_wav")
|
||||
|
||||
# whisper.cpp defaults — override with env vars or CLI args
|
||||
WHISPER_BIN = os.getenv("WHISPER_BIN", r"whisper-cli.exe")
|
||||
WHISPER_MODEL = os.getenv("WHISPER_MODEL", r"models\ggml-medium-q5_0.bin")
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s [%(levelname)s] %(message)s",
|
||||
handlers=[
|
||||
logging.StreamHandler(),
|
||||
logging.FileHandler("transcribe_errors.log"),
|
||||
],
|
||||
)
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def find_ffmpeg() -> str:
|
||||
"""Find ffmpeg executable."""
|
||||
if shutil.which("ffmpeg"):
|
||||
return "ffmpeg"
|
||||
# Check local directories
|
||||
for p in [Path("ffmpeg.exe"), Path("ffmpeg-bin/ffmpeg.exe")]:
|
||||
if p.exists():
|
||||
return str(p.resolve())
|
||||
# Try imageio-ffmpeg (pip fallback)
|
||||
try:
|
||||
import imageio_ffmpeg
|
||||
return imageio_ffmpeg.get_ffmpeg_exe()
|
||||
except ImportError:
|
||||
pass
|
||||
return ""
|
||||
|
||||
|
||||
def convert_to_wav(audio_path: str) -> str:
|
||||
"""
|
||||
Convert audio file to WAV 16kHz mono (optimal for whisper.cpp).
|
||||
Returns path to WAV file. Skips if WAV already exists.
|
||||
"""
|
||||
src = Path(audio_path)
|
||||
|
||||
# Already a WAV file, skip
|
||||
if src.suffix.lower() == ".wav":
|
||||
return audio_path
|
||||
|
||||
WAV_CACHE_DIR.mkdir(exist_ok=True)
|
||||
wav_path = WAV_CACHE_DIR / (src.stem + ".wav")
|
||||
|
||||
# Skip if already converted
|
||||
if wav_path.exists() and wav_path.stat().st_size > 0:
|
||||
log.info(f" WAV cache hit: {wav_path}")
|
||||
return str(wav_path)
|
||||
|
||||
ffmpeg = find_ffmpeg()
|
||||
if not ffmpeg:
|
||||
log.warning(" ffmpeg not found, using original file (may cause bad transcription)")
|
||||
return audio_path
|
||||
|
||||
log.info(f" Converting to WAV: {src.name} -> {wav_path.name}")
|
||||
cmd = [
|
||||
ffmpeg,
|
||||
"-i", audio_path,
|
||||
"-vn", # no video
|
||||
"-acodec", "pcm_s16le", # 16-bit PCM
|
||||
"-ar", "16000", # 16kHz sample rate (whisper standard)
|
||||
"-ac", "1", # mono
|
||||
"-y", # overwrite
|
||||
str(wav_path),
|
||||
]
|
||||
|
||||
try:
|
||||
result = subprocess.run(
|
||||
cmd,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=300, # 5 min max for conversion
|
||||
)
|
||||
if result.returncode != 0:
|
||||
log.error(f" ffmpeg failed: {result.stderr[:300]}")
|
||||
return audio_path
|
||||
|
||||
log.info(f" WAV ready: {wav_path.name} ({wav_path.stat().st_size / 1_048_576:.0f} MB)")
|
||||
return str(wav_path)
|
||||
|
||||
except FileNotFoundError:
|
||||
log.warning(f" ffmpeg not found at: {ffmpeg}")
|
||||
return audio_path
|
||||
except subprocess.TimeoutExpired:
|
||||
log.error(f" ffmpeg conversion timeout for {audio_path}")
|
||||
return audio_path
|
||||
|
||||
|
||||
def load_manifest() -> dict:
|
||||
with open(MANIFEST_PATH, encoding="utf-8") as f:
|
||||
return json.load(f)
|
||||
|
||||
|
||||
def save_manifest(manifest: dict):
|
||||
with open(MANIFEST_PATH, "w", encoding="utf-8") as f:
|
||||
json.dump(manifest, f, indent=2, ensure_ascii=False)
|
||||
|
||||
|
||||
def transcribe_file(audio_path: str, output_base: str) -> bool:
|
||||
"""
|
||||
Run whisper.cpp on a single file.
|
||||
Returns True on success.
|
||||
"""
|
||||
cmd = [
|
||||
WHISPER_BIN,
|
||||
"--model", WHISPER_MODEL,
|
||||
"--language", "ro",
|
||||
"--no-gpu",
|
||||
"--threads", str(os.cpu_count() or 4),
|
||||
"--beam-size", "1",
|
||||
"--best-of", "1",
|
||||
"--output-txt",
|
||||
"--output-srt",
|
||||
"--output-file", output_base,
|
||||
"--file", audio_path,
|
||||
]
|
||||
|
||||
log.info(f" CMD: {' '.join(cmd)}")
|
||||
|
||||
try:
|
||||
# Add whisper.exe's directory to PATH so Windows finds its DLLs
|
||||
env = os.environ.copy()
|
||||
whisper_dir = str(Path(WHISPER_BIN).resolve().parent)
|
||||
env["PATH"] = whisper_dir + os.pathsep + env.get("PATH", "")
|
||||
|
||||
result = subprocess.run(
|
||||
cmd,
|
||||
stdout=sys.stdout,
|
||||
stderr=sys.stderr,
|
||||
timeout=7200, # 2 hour timeout per file
|
||||
env=env,
|
||||
)
|
||||
|
||||
if result.returncode != 0:
|
||||
log.error(f" whisper.cpp failed (exit {result.returncode})")
|
||||
return False
|
||||
|
||||
# Verify output exists and is non-empty
|
||||
txt_path = Path(f"{output_base}.txt")
|
||||
srt_path = Path(f"{output_base}.srt")
|
||||
|
||||
if not txt_path.exists() or txt_path.stat().st_size == 0:
|
||||
log.error(f" Empty or missing transcript: {txt_path}")
|
||||
return False
|
||||
|
||||
log.info(f" Output: {txt_path.name} ({txt_path.stat().st_size} bytes)")
|
||||
if srt_path.exists():
|
||||
log.info(f" Output: {srt_path.name} ({srt_path.stat().st_size} bytes)")
|
||||
|
||||
return True
|
||||
|
||||
except subprocess.TimeoutExpired:
|
||||
log.error(f" Timeout (>2h) for {audio_path}")
|
||||
return False
|
||||
except FileNotFoundError:
|
||||
log.error(f" whisper.cpp not found at: {WHISPER_BIN}")
|
||||
log.error(f" Set WHISPER_BIN env var or put whisper-cli.exe in PATH")
|
||||
return False
|
||||
except Exception as e:
|
||||
log.error(f" Error: {e}")
|
||||
return False
|
||||
|
||||
|
||||
def parse_module_filter(arg: str) -> set[int]:
|
||||
"""Parse module filter like '1-3' or '4,5' or '1-3,5' into a set of 1-based indices."""
|
||||
result = set()
|
||||
for part in arg.split(","):
|
||||
part = part.strip()
|
||||
if "-" in part:
|
||||
a, b = part.split("-", 1)
|
||||
result.update(range(int(a), int(b) + 1))
|
||||
else:
|
||||
result.add(int(part))
|
||||
return result
|
||||
|
||||
|
||||
def main():
|
||||
if not MANIFEST_PATH.exists():
|
||||
log.error("manifest.json not found. Run download.py first.")
|
||||
sys.exit(1)
|
||||
|
||||
# Parse --modules filter
|
||||
module_filter = None
|
||||
if "--modules" in sys.argv:
|
||||
idx = sys.argv.index("--modules")
|
||||
if idx + 1 < len(sys.argv):
|
||||
module_filter = parse_module_filter(sys.argv[idx + 1])
|
||||
log.info(f"Module filter: {sorted(module_filter)}")
|
||||
|
||||
manifest = load_manifest()
|
||||
TRANSCRIPTS_DIR.mkdir(exist_ok=True)
|
||||
|
||||
total = 0
|
||||
transcribed = 0
|
||||
skipped = 0
|
||||
failed = 0
|
||||
|
||||
for mod_idx, mod in enumerate(manifest["modules"], 1):
|
||||
if module_filter and mod_idx not in module_filter:
|
||||
log.info(f"\nSkipping module {mod_idx}: {mod['name']}")
|
||||
continue
|
||||
log.info(f"\n{'='*60}")
|
||||
log.info(f"Module: {mod['name']}")
|
||||
log.info(f"{'='*60}")
|
||||
|
||||
for lec in mod["lectures"]:
|
||||
total += 1
|
||||
|
||||
if lec.get("download_status") != "complete":
|
||||
log.warning(f" Skipping (not downloaded): {lec['title']}")
|
||||
continue
|
||||
|
||||
audio_path = lec["audio_path"]
|
||||
stem = Path(lec["original_filename"]).stem.replace(" [Audio]", "")
|
||||
output_base = str(TRANSCRIPTS_DIR / stem)
|
||||
|
||||
# Check if already transcribed
|
||||
txt_path = Path(f"{output_base}.txt")
|
||||
if txt_path.exists() and txt_path.stat().st_size > 0:
|
||||
lec["transcribe_status"] = "complete"
|
||||
skipped += 1
|
||||
log.info(f" Skipping (exists): {stem}.txt")
|
||||
continue
|
||||
|
||||
log.info(f" Transcribing: {lec['title']}")
|
||||
log.info(f" File: {audio_path}")
|
||||
|
||||
# Convert MP3 -> WAV 16kHz mono for reliable whisper.cpp input
|
||||
wav_path = convert_to_wav(audio_path)
|
||||
|
||||
if transcribe_file(wav_path, output_base):
|
||||
lec["transcribe_status"] = "complete"
|
||||
transcribed += 1
|
||||
else:
|
||||
lec["transcribe_status"] = "failed"
|
||||
failed += 1
|
||||
|
||||
# Save manifest after each file (checkpoint)
|
||||
save_manifest(manifest)
|
||||
|
||||
# Quality gate: pause after first module
|
||||
if mod == manifest["modules"][0] and transcribed > 0:
|
||||
log.info("\n" + "!" * 60)
|
||||
log.info("QUALITY GATE: First module complete.")
|
||||
log.info("Spot-check 2-3 transcripts in transcripts/ before continuing.")
|
||||
log.info("Press Enter to continue, or Ctrl+C to abort...")
|
||||
log.info("!" * 60)
|
||||
try:
|
||||
input()
|
||||
except EOFError:
|
||||
pass # Non-interactive mode, continue
|
||||
|
||||
# Validation
|
||||
empty_outputs = [
|
||||
lec["title"]
|
||||
for mod in manifest["modules"]
|
||||
for lec in mod["lectures"]
|
||||
if lec.get("transcribe_status") == "complete"
|
||||
and not Path(lec["transcript_path"]).exists()
|
||||
]
|
||||
|
||||
log.info("\n" + "=" * 60)
|
||||
log.info(f"Transcribed {transcribed}/{total} files, {skipped} skipped, {failed} failures.")
|
||||
log.info(f"No empty outputs: {'PASS' if not empty_outputs else 'FAIL'}")
|
||||
if empty_outputs:
|
||||
for t in empty_outputs:
|
||||
log.error(f" Missing transcript: {t}")
|
||||
log.info("=" * 60)
|
||||
|
||||
save_manifest(manifest)
|
||||
|
||||
if failed:
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
||||
Reference in New Issue
Block a user