Files
nlp-master/CLAUDE.md
Marius Mutu 40f821f5df docs: actualizează procesul + prompt-urile cu ce s-a folosit efectiv la M6
- PROCES_SUMARIZARE: proces în 2 faze cu agenți paraleli + extragere
  intermediară (_m{N}_extract_*.md) + filtru conținut comercial + naming
  variant (M6+ folosește "Master 2025 M{N}" vs M1-M5 "Master 25M{N}")
- PROMPT_EXPERIENTIAL: proces în 4 pași (extragere → per zi → merge →
  aplicări), PROMPT_EXTRAGERE nou, variantă 3 zile pentru module ca M6,
  concept multi-temă (M6_ARHETIPURI_CREDO), structuri prompts detaliate
- CLAUDE.md: 5 module → 6 (curs complet) + secțiune "Documentația
  sumarizărilor" cu pointeri
- TODOS.md: eliminat task-urile M6 (done), păstrat doar plicul sigilat M1
  + moratoriul 3 săptămâni post-Master + checklist pentru viitor

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:05:29 +03:00

4.2 KiB
Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

NLP Master is a personal audio-to-text pipeline that downloads, transcribes, and summarizes Romanian NLP course lectures from cursuri.aresens.ro/curs/26. It processes ~44 MP3 files (~70 hours total) across 6 modules. Curs complet (M1-M6) finalizat aprilie 2026.

Architecture

Three-stage batch pipeline, all driven by a shared manifest.json:

  1. download.py — Logs into the course site (credentials from .env), scrapes module/lecture structure, downloads MP3s to audio/. Updates manifest with download status. Resumable (skips existing files).

  2. transcribe.py — Reads manifest, converts MP3→WAV (16kHz mono via ffmpeg), runs whisper.cpp (whisper-cli.exe) for Romanian speech-to-text. Outputs .txt and .srt to transcripts/. Has a quality gate after the first module. Resumable. Supports --modules 1-3 filter.

  3. summarize.py — Generates Claude-compatible prompts from transcripts (chunks long texts at sentence boundaries with overlap). --compile flag assembles all summaries into SUPORT_CURS.md master study guide. Summaries are in Romanian.

setup_whisper.py — Auto-downloads whisper.cpp (Vulkan build for AMD GPU), ffmpeg, and the Whisper model. Called by run.bat.

run.bat — Windows batch orchestrator: checks prerequisites, auto-installs missing components, creates venv, runs download→transcribe pipeline. Accepts optional module filter argument.

Commands

# Full pipeline (Windows native)
run.bat              # download + transcribe all modules
run.bat 4-5          # transcribe only modules 4-5

# Individual steps (from venv)
python download.py                # download audio files
python transcribe.py              # transcribe all
python transcribe.py --modules 1  # transcribe module 1 only
python summarize.py               # print summary prompts to stdout
python summarize.py --compile     # compile SUPORT_CURS.md from existing summaries

# MD → PDF (from WSL2, uses .venv_pdf)
.venv_pdf/bin/python md_to_pdf.py                    # all MODUL*_*.md → summaries/pdf/
.venv_pdf/bin/python md_to_pdf.py --modules 1-3      # specific modules
.venv_pdf/bin/python md_to_pdf.py summaries/X.md     # specific file

# Setup components individually
python setup_whisper.py whisper   # download whisper.cpp binary
python setup_whisper.py model     # download Whisper model
python setup_whisper.py ffmpeg    # download ffmpeg

Key Design Decisions

  • Platform split: Scripts run on native Windows (whisper.cpp needs Vulkan GPU access). Claude Code runs from WSL2 for summaries.
  • Vulkan, not CUDA: Hardware is AMD Radeon RX 6600M 8GB (RDNA2). whisper.cpp is built with Vulkan backend.
  • Model: ggml-medium-q5_0.bin (quantized medium, fits in 8GB VRAM). Stored in models/.
  • manifest.json is the shared state between all scripts — tracks download/transcribe status per lecture. Checkpointed after each file.
  • Resumability: All scripts skip already-completed files. Safe to re-run after failures or when new modules appear.
  • Environment variables: COURSE_USERNAME and COURSE_PASSWORD in .env. Optional: WHISPER_BIN, WHISPER_MODEL to override paths.

Dependencies

Python packages (in requirements.txt): requests, beautifulsoup4, python-dotenv

External tools (auto-installed by run.bat/setup_whisper.py):

  • whisper.cpp (whisper-cli.exe) with Vulkan support
  • ffmpeg (for MP3→WAV conversion)

Documentația sumarizărilor

Toate procesele de sumarizare sunt documentate în:

  • PROCES_SUMARIZARE.md — procesul standard (3 nivele per modul: EXHAUSTIV / CONCENTRAT / CHEATSHEET) + cum se folosesc agenți paraleli
  • PROMPT_EXPERIENTIAL.md — caietul experiențial de facilitator (4 pași: extragere → per zi → merge → aplicări) cu prompts complete

Output-uri în summaries/:

  • 6 module × 3 fișiere standard (MODUL{N}_EXHAUSTIV/CONCENTRAT/CHEATSHEET.md)
  • 6 module × 2 fișiere experiențiale (MODUL{N}{CONCEPT}.md + MODUL{N}{CONCEPT}_APLICARI.md)
  • 5 fișiere cross-modul: INDEX_EXERCITII, HARTA_CONEXIUNI, GLOSAR_CREDINTE, GHID_FACILITARE, METAFORE_POVESTI
  • summaries/pdf/ — toate generate cu md_to_pdf.py