- PROCES_SUMARIZARE: proces în 2 faze cu agenți paraleli + extragere
intermediară (_m{N}_extract_*.md) + filtru conținut comercial + naming
variant (M6+ folosește "Master 2025 M{N}" vs M1-M5 "Master 25M{N}")
- PROMPT_EXPERIENTIAL: proces în 4 pași (extragere → per zi → merge →
aplicări), PROMPT_EXTRAGERE nou, variantă 3 zile pentru module ca M6,
concept multi-temă (M6_ARHETIPURI_CREDO), structuri prompts detaliate
- CLAUDE.md: 5 module → 6 (curs complet) + secțiune "Documentația
sumarizărilor" cu pointeri
- TODOS.md: eliminat task-urile M6 (done), păstrat doar plicul sigilat M1
+ moratoriul 3 săptămâni post-Master + checklist pentru viitor
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
76 lines
4.2 KiB
Markdown
76 lines
4.2 KiB
Markdown
# CLAUDE.md
|
||
|
||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||
|
||
## Project Overview
|
||
|
||
NLP Master is a personal audio-to-text pipeline that downloads, transcribes, and summarizes Romanian NLP course lectures from cursuri.aresens.ro/curs/26. It processes ~44 MP3 files (~70 hours total) across 6 modules. Curs complet (M1-M6) finalizat aprilie 2026.
|
||
|
||
## Architecture
|
||
|
||
Three-stage batch pipeline, all driven by a shared `manifest.json`:
|
||
|
||
1. **download.py** — Logs into the course site (credentials from `.env`), scrapes module/lecture structure, downloads MP3s to `audio/`. Updates manifest with download status. Resumable (skips existing files).
|
||
|
||
2. **transcribe.py** — Reads manifest, converts MP3→WAV (16kHz mono via ffmpeg), runs whisper.cpp (`whisper-cli.exe`) for Romanian speech-to-text. Outputs `.txt` and `.srt` to `transcripts/`. Has a quality gate after the first module. Resumable. Supports `--modules 1-3` filter.
|
||
|
||
3. **summarize.py** — Generates Claude-compatible prompts from transcripts (chunks long texts at sentence boundaries with overlap). `--compile` flag assembles all summaries into `SUPORT_CURS.md` master study guide. Summaries are in Romanian.
|
||
|
||
**setup_whisper.py** — Auto-downloads whisper.cpp (Vulkan build for AMD GPU), ffmpeg, and the Whisper model. Called by `run.bat`.
|
||
|
||
**run.bat** — Windows batch orchestrator: checks prerequisites, auto-installs missing components, creates venv, runs download→transcribe pipeline. Accepts optional module filter argument.
|
||
|
||
## Commands
|
||
|
||
```bash
|
||
# Full pipeline (Windows native)
|
||
run.bat # download + transcribe all modules
|
||
run.bat 4-5 # transcribe only modules 4-5
|
||
|
||
# Individual steps (from venv)
|
||
python download.py # download audio files
|
||
python transcribe.py # transcribe all
|
||
python transcribe.py --modules 1 # transcribe module 1 only
|
||
python summarize.py # print summary prompts to stdout
|
||
python summarize.py --compile # compile SUPORT_CURS.md from existing summaries
|
||
|
||
# MD → PDF (from WSL2, uses .venv_pdf)
|
||
.venv_pdf/bin/python md_to_pdf.py # all MODUL*_*.md → summaries/pdf/
|
||
.venv_pdf/bin/python md_to_pdf.py --modules 1-3 # specific modules
|
||
.venv_pdf/bin/python md_to_pdf.py summaries/X.md # specific file
|
||
|
||
# Setup components individually
|
||
python setup_whisper.py whisper # download whisper.cpp binary
|
||
python setup_whisper.py model # download Whisper model
|
||
python setup_whisper.py ffmpeg # download ffmpeg
|
||
```
|
||
|
||
## Key Design Decisions
|
||
|
||
- **Platform split:** Scripts run on native Windows (whisper.cpp needs Vulkan GPU access). Claude Code runs from WSL2 for summaries.
|
||
- **Vulkan, not CUDA:** Hardware is AMD Radeon RX 6600M 8GB (RDNA2). whisper.cpp is built with Vulkan backend.
|
||
- **Model:** `ggml-medium-q5_0.bin` (quantized medium, fits in 8GB VRAM). Stored in `models/`.
|
||
- **manifest.json** is the shared state between all scripts — tracks download/transcribe status per lecture. Checkpointed after each file.
|
||
- **Resumability:** All scripts skip already-completed files. Safe to re-run after failures or when new modules appear.
|
||
- **Environment variables:** `COURSE_USERNAME` and `COURSE_PASSWORD` in `.env`. Optional: `WHISPER_BIN`, `WHISPER_MODEL` to override paths.
|
||
|
||
## Dependencies
|
||
|
||
Python packages (in requirements.txt): `requests`, `beautifulsoup4`, `python-dotenv`
|
||
|
||
External tools (auto-installed by run.bat/setup_whisper.py):
|
||
- whisper.cpp (whisper-cli.exe) with Vulkan support
|
||
- ffmpeg (for MP3→WAV conversion)
|
||
|
||
## Documentația sumarizărilor
|
||
|
||
Toate procesele de sumarizare sunt documentate în:
|
||
- `PROCES_SUMARIZARE.md` — procesul standard (3 nivele per modul: EXHAUSTIV / CONCENTRAT / CHEATSHEET) + cum se folosesc agenți paraleli
|
||
- `PROMPT_EXPERIENTIAL.md` — caietul experiențial de facilitator (4 pași: extragere → per zi → merge → aplicări) cu prompts complete
|
||
|
||
Output-uri în `summaries/`:
|
||
- 6 module × 3 fișiere standard (MODUL{N}_EXHAUSTIV/CONCENTRAT/CHEATSHEET.md)
|
||
- 6 module × 2 fișiere experiențiale (MODUL{N}_{CONCEPT}.md + MODUL{N}_{CONCEPT}_APLICARI.md)
|
||
- 5 fișiere cross-modul: INDEX_EXERCITII, HARTA_CONEXIUNI, GLOSAR_CREDINTE, GHID_FACILITARE, METAFORE_POVESTI
|
||
- `summaries/pdf/` — toate generate cu `md_to_pdf.py`
|