- audio/Modul {N}/filename.mp3 — fiecare modul în subdirector separat
pentru copiere pe telefon și transfer între PC-uri.
- PDF-urile se păstrează ca sursă în summaries/pdf/ (fără extract txt).
- transcribe_status="pdf_source_only" pentru lecțiile PDF → summarize.py
le filtrează automat.
- Fix coliziune manifest transcript_path (stem-based, nu preserve prior).
- .bat per modul (M2-M8) + dispatchers run_pc1_all (M2-M5) + run_pc2_all
(M6-M8) pentru partajare work pe 2 PC-uri.
- prepare_pc2_bundle.py: zip cu scripts + manifest + .env + PDFs pentru
PC2 (self-installs whisper.cpp/model/ffmpeg la primul run).
- M1 whisper complete (49/49 audio+vimeo transcrise).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5.6 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
NLP Master is a personal audio-to-text pipeline that downloads, transcribes, and summarizes Romanian NLP course lectures from cursuri.aresens.ro/curs/26. It processes ~44 MP3 files (~70 hours total) across 6 modules. Curs complet (M1-M6) finalizat aprilie 2026.
Architecture
Three-stage batch pipeline, all driven by a shared manifest.json:
-
download.py — Logs into the course site (credentials from
.env), scrapes module/lecture structure, downloads MP3s toaudio/. Updates manifest with download status. Resumable (skips existing files). -
transcribe.py — Reads manifest, converts MP3→WAV (16kHz mono via ffmpeg), runs whisper.cpp (
whisper-cli.exe) for Romanian speech-to-text. Outputs.txtand.srttotranscripts/. Has a quality gate after the first module. Resumable. Supports--modules 1-3filter. -
summarize.py — Generates Claude-compatible prompts from transcripts (chunks long texts at sentence boundaries with overlap).
--compileflag assembles all summaries intoSUPORT_CURS.mdmaster study guide. Summaries are in Romanian.
setup_whisper.py — Auto-downloads whisper.cpp (Vulkan build for AMD GPU), ffmpeg, and the Whisper model. Called by run.bat.
run.bat — Windows batch orchestrator: checks prerequisites, auto-installs missing components, creates venv, runs download→transcribe pipeline. Accepts optional module filter argument.
retranscribe_tail.py / fix_hallucinations.bat — Post-processing repair. Scans SRTs for Whisper hallucination bursts (repeated lines), classifies each as burst (in-file loop) or tail (runs to EOF), extracts the bad audio segment with ffmpeg, re-transcribes it with anti-hallucination whisper.cpp flags, and splices the result back. Supports --dry-run and per-file targeting. Run only after transcription produces visibly broken outputs.
Commands
# Full pipeline (Windows native)
run.bat # download + transcribe all modules
run.bat 4-5 # transcribe only modules 4-5
# Individual steps (from venv)
python download.py # download audio files
python transcribe.py # transcribe all
python transcribe.py --modules 1 # transcribe module 1 only
python summarize.py # print summary prompts to stdout
python summarize.py --compile # compile SUPORT_CURS.md from existing summaries
# Repair hallucinated transcripts
python retranscribe_tail.py --dry-run # report what would be fixed
python retranscribe_tail.py # auto-fix all broken transcripts
python retranscribe_tail.py "Master 25M1 Z2B" # fix a single file
fix_hallucinations.bat # same, via Windows wrapper
# MD → PDF (from WSL2, uses .venv_pdf)
.venv_pdf/bin/python md_to_pdf.py # all MODUL*_*.md → summaries/pdf/
.venv_pdf/bin/python md_to_pdf.py --modules 1-3 # specific modules
.venv_pdf/bin/python md_to_pdf.py summaries/X.md # specific file
# Setup components individually
python setup_whisper.py whisper # download whisper.cpp binary
python setup_whisper.py model # download Whisper model
python setup_whisper.py ffmpeg # download ffmpeg
Key Design Decisions
- Platform split: Scripts run on native Windows (whisper.cpp needs Vulkan GPU access). Claude Code runs from WSL2 for summaries.
- Vulkan, not CUDA: Hardware is AMD Radeon RX 6600M 8GB (RDNA2). whisper.cpp is built with Vulkan backend.
- Model:
ggml-medium-q5_0.bin(quantized medium, fits in 8GB VRAM). Stored inmodels/. - manifest.json is the shared state between all scripts — tracks download/transcribe status per lecture. Checkpointed after each file.
- Resumability: All scripts skip already-completed files. Safe to re-run after failures or when new modules appear.
- Environment variables:
COURSE_USERNAMEandCOURSE_PASSWORDin.env. Optional:WHISPER_BIN,WHISPER_MODELto override paths.
Dependencies
Python packages (in requirements.txt): requests, beautifulsoup4, python-dotenv
External tools (auto-installed by run.bat/setup_whisper.py):
- whisper.cpp (whisper-cli.exe) with Vulkan support
- ffmpeg (for MP3→WAV conversion)
Gstack Skills
For all web browsing tasks, use the /browse skill from gstack. Never use mcp__claude-in-chrome__* tools.
Available gstack skills:
/office-hours, /plan-ceo-review, /plan-eng-review, /plan-design-review, /design-consultation, /design-shotgun, /design-html, /review, /ship, /land-and-deploy, /canary, /benchmark, /browse, /connect-chrome, /qa, /qa-only, /design-review, /setup-browser-cookies, /setup-deploy, /retro, /investigate, /document-release, /codex, /cso, /autoplan, /plan-devex-review, /devex-review, /careful, /freeze, /guard, /unfreeze, /gstack-upgrade, /learn
Documentația sumarizărilor
Toate procesele de sumarizare sunt documentate în:
PROCES_SUMARIZARE.md— procesul standard (3 nivele per modul: EXHAUSTIV / CONCENTRAT / CHEATSHEET) + cum se folosesc agenți paraleliPROMPT_EXPERIENTIAL.md— caietul experiențial de facilitator (4 pași: extragere → per zi → merge → aplicări) cu prompts complete
Output-uri în summaries/:
- 6 module × 3 fișiere standard (MODUL{N}_EXHAUSTIV/CONCENTRAT/CHEATSHEET.md)
- 6 module × 2 fișiere experiențiale (MODUL{N}{CONCEPT}.md + MODUL{N}{CONCEPT}_APLICARI.md)
- 5 fișiere cross-modul: INDEX_EXERCITII, HARTA_CONEXIUNI, GLOSAR_CREDINTE, GHID_FACILITARE, METAFORE_POVESTI
summaries/pdf/— toate generate cumd_to_pdf.py