nlp-master/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

NLP Master is a personal audio-to-text pipeline that downloads, transcribes, and summarizes Romanian NLP course lectures from cursuri.aresens.ro/curs/26. It processes ~44 MP3 files (~70 hours total) across 6 modules. Curs complet (M1-M6) finalizat aprilie 2026.

## Architecture

Three-stage batch pipeline, all driven by a shared `manifest.json`:

1. **download.py** — Logs into the course site (credentials from `.env`), scrapes module/lecture structure, downloads MP3s to `audio/`. Updates manifest with download status. Resumable (skips existing files).

2. **transcribe.py** — Reads manifest, converts MP3→WAV (16kHz mono via ffmpeg), runs whisper.cpp (`whisper-cli.exe`) for Romanian speech-to-text. Outputs `.txt` and `.srt` to `transcripts/`. Has a quality gate after the first module. Resumable. Supports `--modules 1-3` filter.

3. **summarize.py** — Generates Claude-compatible prompts from transcripts (chunks long texts at sentence boundaries with overlap). `--compile` flag assembles all summaries into `SUPORT_CURS.md` master study guide. Summaries are in Romanian.

**setup_whisper.py** — Auto-downloads whisper.cpp (Vulkan build for AMD GPU), ffmpeg, and the Whisper model. Called by `run.bat`.

**run.bat** — Windows batch orchestrator: checks prerequisites, auto-installs missing components, creates venv, runs download→transcribe pipeline. Accepts optional module filter argument.

**retranscribe_tail.py** / **fix_hallucinations.bat** — Post-processing repair. Scans SRTs for Whisper hallucination bursts (repeated lines), classifies each as `burst` (in-file loop) or `tail` (runs to EOF), extracts the bad audio segment with ffmpeg, re-transcribes it with anti-hallucination whisper.cpp flags, and splices the result back. Supports `--dry-run` and per-file targeting. Run only after transcription produces visibly broken outputs.

## Commands

```bash
# Full pipeline (Windows native)
run.bat              # download + transcribe all modules
run.bat 4-5          # transcribe only modules 4-5

# Individual steps (from venv)
python download.py                # download audio files
python transcribe.py              # transcribe all
python transcribe.py --modules 1  # transcribe module 1 only
python summarize.py               # print summary prompts to stdout
python summarize.py --compile     # compile SUPORT_CURS.md from existing summaries

# Repair hallucinated transcripts
python retranscribe_tail.py --dry-run              # report what would be fixed
python retranscribe_tail.py                        # auto-fix all broken transcripts
python retranscribe_tail.py "Master 25M1 Z2B"     # fix a single file
fix_hallucinations.bat                             # same, via Windows wrapper

# MD → PDF (from WSL2, uses .venv_pdf)
.venv_pdf/bin/python md_to_pdf.py                    # all MODUL*_*.md → summaries/pdf/
.venv_pdf/bin/python md_to_pdf.py --modules 1-3      # specific modules
.venv_pdf/bin/python md_to_pdf.py summaries/X.md     # specific file

# Setup components individually
python setup_whisper.py whisper   # download whisper.cpp binary
python setup_whisper.py model     # download Whisper model
python setup_whisper.py ffmpeg    # download ffmpeg
```

## Key Design Decisions

- **Platform split:** Scripts run on native Windows (whisper.cpp needs Vulkan GPU access). Claude Code runs from WSL2 for summaries.
- **Vulkan, not CUDA:** Hardware is AMD Radeon RX 6600M 8GB (RDNA2). whisper.cpp is built with Vulkan backend.
- **Model:** `ggml-medium-q5_0.bin` (quantized medium, fits in 8GB VRAM). Stored in `models/`.
- **manifest.json** is the shared state between all scripts — tracks download/transcribe status per lecture. Checkpointed after each file.
- **Resumability:** All scripts skip already-completed files. Safe to re-run after failures or when new modules appear.
- **Environment variables:** `COURSE_USERNAME` and `COURSE_PASSWORD` in `.env`. Optional: `WHISPER_BIN`, `WHISPER_MODEL` to override paths.

## Dependencies

Python packages (in requirements.txt): `requests`, `beautifulsoup4`, `python-dotenv`

External tools (auto-installed by run.bat/setup_whisper.py):
- whisper.cpp (whisper-cli.exe) with Vulkan support
- ffmpeg (for MP3→WAV conversion)

## Gstack Skills

For all web browsing tasks, use the `/browse` skill from gstack. Never use `mcp__claude-in-chrome__*` tools.

Available gstack skills:
`/office-hours`, `/plan-ceo-review`, `/plan-eng-review`, `/plan-design-review`, `/design-consultation`, `/design-shotgun`, `/design-html`, `/review`, `/ship`, `/land-and-deploy`, `/canary`, `/benchmark`, `/browse`, `/connect-chrome`, `/qa`, `/qa-only`, `/design-review`, `/setup-browser-cookies`, `/setup-deploy`, `/retro`, `/investigate`, `/document-release`, `/codex`, `/cso`, `/autoplan`, `/plan-devex-review`, `/devex-review`, `/careful`, `/freeze`, `/guard`, `/unfreeze`, `/gstack-upgrade`, `/learn`

## Documentația sumarizărilor

Toate procesele de sumarizare sunt documentate în:
- `PROCES_SUMARIZARE.md` — procesul standard (3 nivele per modul: EXHAUSTIV / CONCENTRAT / CHEATSHEET) + cum se folosesc agenți paraleli
- `PROMPT_EXPERIENTIAL.md` — caietul experiențial de facilitator (4 pași: extragere → per zi → merge → aplicări) cu prompts complete

Output-uri în `summaries/`:
- 6 module × 3 fișiere standard (MODUL{N}_EXHAUSTIV/CONCENTRAT/CHEATSHEET.md)
- 6 module × 2 fișiere experiențiale (MODUL{N}_{CONCEPT}.md + MODUL{N}_{CONCEPT}_APLICARI.md)
- 5 fișiere cross-modul: INDEX_EXERCITII, HARTA_CONEXIUNI, GLOSAR_CREDINTE, GHID_FACILITARE, METAFORE_POVESTI
- `summaries/pdf/` — toate generate cu `md_to_pdf.py`