feat: al 4-lea tip de lecție — PDF (extract text cu pypdf)

Recon-ul pe practitioner M1 arată că unele lecții n-au nici audio nici Vimeo iframe — doar un link "Descarcă rezumat PDF" (/resurse/*.pdf). Scraperul vechi le clasifica drept "text" și le marca failed (HTML body avea <50 chars). - classify_lesson: detectează acum a[href$=".pdf"] → type="pdf". - download_pdf_and_extract: download PDF via session autentificat (pypdf reader) → transcript .txt cu header + conținut pe pagini → șterge PDF sursă (preferință utilizator: nu păstrez sursele). - Branch în main loop pentru type=="pdf". - requirements.txt: + pypdf. - transcribe.py: skip type in ("text", "pdf") — transcript e deja scris de download.py. Limitări: PDF-uri cu conținut vizual (infografice, diagrame) extrag puțin text. Titlul și textul inline sunt capturate; restul rămâne pentru review manual. Testat pe 4 PDF-uri M1 practitioner (Premisele NLP, Forme de Pacing, Gesturi de calmare, Exercitiu Pacing): 3/4 extract bun (877-3068 bytes), 1/4 conținut predominant grafic (203 bytes). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-22 23:01:09 +03:00
parent a7cb06ac3e
commit 2e4bb88624
3 changed files with 118 additions and 5 deletions
--- a/transcribe.py
+++ b/transcribe.py
@@ -211,11 +211,11 @@ def main():
        for lec in mod["lectures"]:
            total += 1

-            # Text lectures bypass whisper — transcript written by download.py.
-            if lec.get("type") == "text":
+            # Text and PDF lectures bypass whisper — transcript written by download.py.
+            if lec.get("type") in ("text", "pdf"):
                lec["transcribe_status"] = "complete"
                skipped += 1
-                log.info(f"  Skipping text: {lec['title']}")
+                log.info(f"  Skipping {lec.get('type')}: {lec['title']}")
                continue

            if lec.get("download_status") != "complete":