Files
nlp-master/CLAUDE.md
Marius Mutu 40f821f5df docs: actualizează procesul + prompt-urile cu ce s-a folosit efectiv la M6
- PROCES_SUMARIZARE: proces în 2 faze cu agenți paraleli + extragere
  intermediară (_m{N}_extract_*.md) + filtru conținut comercial + naming
  variant (M6+ folosește "Master 2025 M{N}" vs M1-M5 "Master 25M{N}")
- PROMPT_EXPERIENTIAL: proces în 4 pași (extragere → per zi → merge →
  aplicări), PROMPT_EXTRAGERE nou, variantă 3 zile pentru module ca M6,
  concept multi-temă (M6_ARHETIPURI_CREDO), structuri prompts detaliate
- CLAUDE.md: 5 module → 6 (curs complet) + secțiune "Documentația
  sumarizărilor" cu pointeri
- TODOS.md: eliminat task-urile M6 (done), păstrat doar plicul sigilat M1
  + moratoriul 3 săptămâni post-Master + checklist pentru viitor

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:05:29 +03:00

76 lines
4.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
NLP Master is a personal audio-to-text pipeline that downloads, transcribes, and summarizes Romanian NLP course lectures from cursuri.aresens.ro/curs/26. It processes ~44 MP3 files (~70 hours total) across 6 modules. Curs complet (M1-M6) finalizat aprilie 2026.
## Architecture
Three-stage batch pipeline, all driven by a shared `manifest.json`:
1. **download.py** — Logs into the course site (credentials from `.env`), scrapes module/lecture structure, downloads MP3s to `audio/`. Updates manifest with download status. Resumable (skips existing files).
2. **transcribe.py** — Reads manifest, converts MP3→WAV (16kHz mono via ffmpeg), runs whisper.cpp (`whisper-cli.exe`) for Romanian speech-to-text. Outputs `.txt` and `.srt` to `transcripts/`. Has a quality gate after the first module. Resumable. Supports `--modules 1-3` filter.
3. **summarize.py** — Generates Claude-compatible prompts from transcripts (chunks long texts at sentence boundaries with overlap). `--compile` flag assembles all summaries into `SUPORT_CURS.md` master study guide. Summaries are in Romanian.
**setup_whisper.py** — Auto-downloads whisper.cpp (Vulkan build for AMD GPU), ffmpeg, and the Whisper model. Called by `run.bat`.
**run.bat** — Windows batch orchestrator: checks prerequisites, auto-installs missing components, creates venv, runs download→transcribe pipeline. Accepts optional module filter argument.
## Commands
```bash
# Full pipeline (Windows native)
run.bat # download + transcribe all modules
run.bat 4-5 # transcribe only modules 4-5
# Individual steps (from venv)
python download.py # download audio files
python transcribe.py # transcribe all
python transcribe.py --modules 1 # transcribe module 1 only
python summarize.py # print summary prompts to stdout
python summarize.py --compile # compile SUPORT_CURS.md from existing summaries
# MD → PDF (from WSL2, uses .venv_pdf)
.venv_pdf/bin/python md_to_pdf.py # all MODUL*_*.md → summaries/pdf/
.venv_pdf/bin/python md_to_pdf.py --modules 1-3 # specific modules
.venv_pdf/bin/python md_to_pdf.py summaries/X.md # specific file
# Setup components individually
python setup_whisper.py whisper # download whisper.cpp binary
python setup_whisper.py model # download Whisper model
python setup_whisper.py ffmpeg # download ffmpeg
```
## Key Design Decisions
- **Platform split:** Scripts run on native Windows (whisper.cpp needs Vulkan GPU access). Claude Code runs from WSL2 for summaries.
- **Vulkan, not CUDA:** Hardware is AMD Radeon RX 6600M 8GB (RDNA2). whisper.cpp is built with Vulkan backend.
- **Model:** `ggml-medium-q5_0.bin` (quantized medium, fits in 8GB VRAM). Stored in `models/`.
- **manifest.json** is the shared state between all scripts — tracks download/transcribe status per lecture. Checkpointed after each file.
- **Resumability:** All scripts skip already-completed files. Safe to re-run after failures or when new modules appear.
- **Environment variables:** `COURSE_USERNAME` and `COURSE_PASSWORD` in `.env`. Optional: `WHISPER_BIN`, `WHISPER_MODEL` to override paths.
## Dependencies
Python packages (in requirements.txt): `requests`, `beautifulsoup4`, `python-dotenv`
External tools (auto-installed by run.bat/setup_whisper.py):
- whisper.cpp (whisper-cli.exe) with Vulkan support
- ffmpeg (for MP3→WAV conversion)
## Documentația sumarizărilor
Toate procesele de sumarizare sunt documentate în:
- `PROCES_SUMARIZARE.md` — procesul standard (3 nivele per modul: EXHAUSTIV / CONCENTRAT / CHEATSHEET) + cum se folosesc agenți paraleli
- `PROMPT_EXPERIENTIAL.md` — caietul experiențial de facilitator (4 pași: extragere → per zi → merge → aplicări) cu prompts complete
Output-uri în `summaries/`:
- 6 module × 3 fișiere standard (MODUL{N}_EXHAUSTIV/CONCENTRAT/CHEATSHEET.md)
- 6 module × 2 fișiere experiențiale (MODUL{N}_{CONCEPT}.md + MODUL{N}_{CONCEPT}_APLICARI.md)
- 5 fișiere cross-modul: INDEX_EXERCITII, HARTA_CONEXIUNI, GLOSAR_CREDINTE, GHID_FACILITARE, METAFORE_POVESTI
- `summaries/pdf/` — toate generate cu `md_to_pdf.py`