Remove manifest.json from git tracking (machine-specific state)

Add manifest.json to .gitignore so it doesn't conflict between machines. Also add CLAUDE.md for Claude Code guidance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 02:02:47 +02:00
parent 45e72bc89b
commit 56e676618f
3 changed files with 95 additions and 567 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,58 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+NLP Master is a personal audio-to-text pipeline that downloads, transcribes, and summarizes Romanian NLP course lectures from cursuri.aresens.ro/curs/26. It processes ~35 MP3 files (~58 hours total) across 5 modules (6th pending).
+
+## Architecture
+
+Three-stage batch pipeline, all driven by a shared `manifest.json`:
+
+1. **download.py** — Logs into the course site (credentials from `.env`), scrapes module/lecture structure, downloads MP3s to `audio/`. Updates manifest with download status. Resumable (skips existing files).
+
+2. **transcribe.py** — Reads manifest, converts MP3→WAV (16kHz mono via ffmpeg), runs whisper.cpp (`whisper-cli.exe`) for Romanian speech-to-text. Outputs `.txt` and `.srt` to `transcripts/`. Has a quality gate after the first module. Resumable. Supports `--modules 1-3` filter.
+
+3. **summarize.py** — Generates Claude-compatible prompts from transcripts (chunks long texts at sentence boundaries with overlap). `--compile` flag assembles all summaries into `SUPORT_CURS.md` master study guide. Summaries are in Romanian.
+
+**setup_whisper.py** — Auto-downloads whisper.cpp (Vulkan build for AMD GPU), ffmpeg, and the Whisper model. Called by `run.bat`.
+
+**run.bat** — Windows batch orchestrator: checks prerequisites, auto-installs missing components, creates venv, runs download→transcribe pipeline. Accepts optional module filter argument.
+
+## Commands
+
+```bash
+# Full pipeline (Windows native)
+run.bat              # download + transcribe all modules
+run.bat 4-5          # transcribe only modules 4-5
+
+# Individual steps (from venv)
+python download.py                # download audio files
+python transcribe.py              # transcribe all
+python transcribe.py --modules 1  # transcribe module 1 only
+python summarize.py               # print summary prompts to stdout
+python summarize.py --compile     # compile SUPORT_CURS.md from existing summaries
+
+# Setup components individually
+python setup_whisper.py whisper   # download whisper.cpp binary
+python setup_whisper.py model     # download Whisper model
+python setup_whisper.py ffmpeg    # download ffmpeg
+```
+
+## Key Design Decisions
+
+- **Platform split:** Scripts run on native Windows (whisper.cpp needs Vulkan GPU access). Claude Code runs from WSL2 for summaries.
+- **Vulkan, not CUDA:** Hardware is AMD Radeon RX 6600M 8GB (RDNA2). whisper.cpp is built with Vulkan backend.
+- **Model:** `ggml-medium-q5_0.bin` (quantized medium, fits in 8GB VRAM). Stored in `models/`.
+- **manifest.json** is the shared state between all scripts — tracks download/transcribe status per lecture. Checkpointed after each file.
+- **Resumability:** All scripts skip already-completed files. Safe to re-run after failures or when new modules appear.
+- **Environment variables:** `COURSE_USERNAME` and `COURSE_PASSWORD` in `.env`. Optional: `WHISPER_BIN`, `WHISPER_MODEL` to override paths.
+
+## Dependencies
+
+Python packages (in requirements.txt): `requests`, `beautifulsoup4`, `python-dotenv`
+
+External tools (auto-installed by run.bat/setup_whisper.py):
+- whisper.cpp (whisper-cli.exe) with Vulkan support
+- ffmpeg (for MP3→WAV conversion)