Files
nlp-master/CLAUDE.md
Marius Mutu 6ee53133b7 feat(practitioner): structură per-modul + PDF-uri sursă + split 2-PC
- audio/Modul {N}/filename.mp3 — fiecare modul în subdirector separat
  pentru copiere pe telefon și transfer între PC-uri.
- PDF-urile se păstrează ca sursă în summaries/pdf/ (fără extract txt).
- transcribe_status="pdf_source_only" pentru lecțiile PDF → summarize.py
  le filtrează automat.
- Fix coliziune manifest transcript_path (stem-based, nu preserve prior).
- .bat per modul (M2-M8) + dispatchers run_pc1_all (M2-M5) + run_pc2_all
  (M6-M8) pentru partajare work pe 2 PC-uri.
- prepare_pc2_bundle.py: zip cu scripts + manifest + .env + PDFs pentru
  PC2 (self-installs whisper.cpp/model/ffmpeg la primul run).
- M1 whisper complete (49/49 audio+vimeo transcrise).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 08:48:58 +03:00

91 lines
5.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
NLP Master is a personal audio-to-text pipeline that downloads, transcribes, and summarizes Romanian NLP course lectures from cursuri.aresens.ro/curs/26. It processes ~44 MP3 files (~70 hours total) across 6 modules. Curs complet (M1-M6) finalizat aprilie 2026.
## Architecture
Three-stage batch pipeline, all driven by a shared `manifest.json`:
1. **download.py** — Logs into the course site (credentials from `.env`), scrapes module/lecture structure, downloads MP3s to `audio/`. Updates manifest with download status. Resumable (skips existing files).
2. **transcribe.py** — Reads manifest, converts MP3→WAV (16kHz mono via ffmpeg), runs whisper.cpp (`whisper-cli.exe`) for Romanian speech-to-text. Outputs `.txt` and `.srt` to `transcripts/`. Has a quality gate after the first module. Resumable. Supports `--modules 1-3` filter.
3. **summarize.py** — Generates Claude-compatible prompts from transcripts (chunks long texts at sentence boundaries with overlap). `--compile` flag assembles all summaries into `SUPORT_CURS.md` master study guide. Summaries are in Romanian.
**setup_whisper.py** — Auto-downloads whisper.cpp (Vulkan build for AMD GPU), ffmpeg, and the Whisper model. Called by `run.bat`.
**run.bat** — Windows batch orchestrator: checks prerequisites, auto-installs missing components, creates venv, runs download→transcribe pipeline. Accepts optional module filter argument.
**retranscribe_tail.py** / **fix_hallucinations.bat** — Post-processing repair. Scans SRTs for Whisper hallucination bursts (repeated lines), classifies each as `burst` (in-file loop) or `tail` (runs to EOF), extracts the bad audio segment with ffmpeg, re-transcribes it with anti-hallucination whisper.cpp flags, and splices the result back. Supports `--dry-run` and per-file targeting. Run only after transcription produces visibly broken outputs.
## Commands
```bash
# Full pipeline (Windows native)
run.bat # download + transcribe all modules
run.bat 4-5 # transcribe only modules 4-5
# Individual steps (from venv)
python download.py # download audio files
python transcribe.py # transcribe all
python transcribe.py --modules 1 # transcribe module 1 only
python summarize.py # print summary prompts to stdout
python summarize.py --compile # compile SUPORT_CURS.md from existing summaries
# Repair hallucinated transcripts
python retranscribe_tail.py --dry-run # report what would be fixed
python retranscribe_tail.py # auto-fix all broken transcripts
python retranscribe_tail.py "Master 25M1 Z2B" # fix a single file
fix_hallucinations.bat # same, via Windows wrapper
# MD → PDF (from WSL2, uses .venv_pdf)
.venv_pdf/bin/python md_to_pdf.py # all MODUL*_*.md → summaries/pdf/
.venv_pdf/bin/python md_to_pdf.py --modules 1-3 # specific modules
.venv_pdf/bin/python md_to_pdf.py summaries/X.md # specific file
# Setup components individually
python setup_whisper.py whisper # download whisper.cpp binary
python setup_whisper.py model # download Whisper model
python setup_whisper.py ffmpeg # download ffmpeg
```
## Key Design Decisions
- **Platform split:** Scripts run on native Windows (whisper.cpp needs Vulkan GPU access). Claude Code runs from WSL2 for summaries.
- **Vulkan, not CUDA:** Hardware is AMD Radeon RX 6600M 8GB (RDNA2). whisper.cpp is built with Vulkan backend.
- **Model:** `ggml-medium-q5_0.bin` (quantized medium, fits in 8GB VRAM). Stored in `models/`.
- **manifest.json** is the shared state between all scripts — tracks download/transcribe status per lecture. Checkpointed after each file.
- **Resumability:** All scripts skip already-completed files. Safe to re-run after failures or when new modules appear.
- **Environment variables:** `COURSE_USERNAME` and `COURSE_PASSWORD` in `.env`. Optional: `WHISPER_BIN`, `WHISPER_MODEL` to override paths.
## Dependencies
Python packages (in requirements.txt): `requests`, `beautifulsoup4`, `python-dotenv`
External tools (auto-installed by run.bat/setup_whisper.py):
- whisper.cpp (whisper-cli.exe) with Vulkan support
- ffmpeg (for MP3→WAV conversion)
## Gstack Skills
For all web browsing tasks, use the `/browse` skill from gstack. Never use `mcp__claude-in-chrome__*` tools.
Available gstack skills:
`/office-hours`, `/plan-ceo-review`, `/plan-eng-review`, `/plan-design-review`, `/design-consultation`, `/design-shotgun`, `/design-html`, `/review`, `/ship`, `/land-and-deploy`, `/canary`, `/benchmark`, `/browse`, `/connect-chrome`, `/qa`, `/qa-only`, `/design-review`, `/setup-browser-cookies`, `/setup-deploy`, `/retro`, `/investigate`, `/document-release`, `/codex`, `/cso`, `/autoplan`, `/plan-devex-review`, `/devex-review`, `/careful`, `/freeze`, `/guard`, `/unfreeze`, `/gstack-upgrade`, `/learn`
## Documentația sumarizărilor
Toate procesele de sumarizare sunt documentate în:
- `PROCES_SUMARIZARE.md` — procesul standard (3 nivele per modul: EXHAUSTIV / CONCENTRAT / CHEATSHEET) + cum se folosesc agenți paraleli
- `PROMPT_EXPERIENTIAL.md` — caietul experiențial de facilitator (4 pași: extragere → per zi → merge → aplicări) cu prompts complete
Output-uri în `summaries/`:
- 6 module × 3 fișiere standard (MODUL{N}_EXHAUSTIV/CONCENTRAT/CHEATSHEET.md)
- 6 module × 2 fișiere experiențiale (MODUL{N}_{CONCEPT}.md + MODUL{N}_{CONCEPT}_APLICARI.md)
- 5 fișiere cross-modul: INDEX_EXERCITII, HARTA_CONEXIUNI, GLOSAR_CREDINTE, GHID_FACILITARE, METAFORE_POVESTI
- `summaries/pdf/` — toate generate cu `md_to_pdf.py`