- Complete detailed plan for automated activity extraction from 2000+ files - Hybrid approach: Python scripts for HTML/TXT/MD + Claude for PDF/DOC - Includes full Python extractors with error handling and batch processing - Template for Claude-assisted PDF/DOC processing (high-value files) - Orchestrator script for complete automation workflow - Estimated result: 2000+ activities indexed in 8 hours total work Key components: - HTML extractor for 1876 files (BeautifulSoup + pattern recognition) - Text/MD extractor for 45 files (regex patterns + markdown parsing) - Unified processor with progress tracking and batch saving - Claude extraction templates with JSON import system - Complete automation for 90% of files, manual assist for 10% high-value 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
1160 lines
42 KiB
Markdown
1160 lines
42 KiB
Markdown
# PLAN DETALIAT IMPLEMENTARE STRATEGIA S8 - HYBRID CLAUDE + SCRIPTS
|
|
## Pentru Indexare Activități și Jocuri Cercetășești
|
|
|
|
### CONTEXT ȘI OBIECTIVE
|
|
- **Proiect:** INDEX-SISTEM-JOCURI v2.0
|
|
- **Situație actuală:** 63 activități indexate din INDEX_MASTER.md
|
|
- **Țintă:** 2000+ activități din 2086 fișiere diverse
|
|
- **Strategie:** S8 Hybrid - Scripts Python pentru 90% volum + Claude pentru 10% high-value
|
|
- **Timp total estimat:** 8 ore (poate fi împărțit în mai multe sesiuni)
|
|
- **Buget:** $0 (folosind doar Claude Code existent)
|
|
|
|
### DISTRIBUȚIA FIȘIERELOR (VERIFICATĂ)
|
|
```
|
|
1876 HTML files (89.9%) - Procesare automată cu BeautifulSoup
|
|
122 PDF files (5.8%) - Procesare cu Claude (high-value, densitate mare)
|
|
29 DOC files (1.4%) - Procesare cu Claude
|
|
14 DOCX files (0.7%) - Procesare semi-automată cu python-docx
|
|
35 TXT files (1.7%) - Procesare automată simplă
|
|
10 MD files (0.5%) - Procesare automată simplă
|
|
```
|
|
|
|
### STRUCTURA BAZEI DE DATE EXISTENTE
|
|
```sql
|
|
-- Tabela activities cu toate câmpurile necesare
|
|
id, name, description, rules, variations, category, subcategory,
|
|
source_file, page_reference, age_group_min, age_group_max,
|
|
participants_min, participants_max, duration_min, duration_max,
|
|
materials_category, materials_list, skills_developed,
|
|
difficulty_level, keywords, tags, popularity_score
|
|
```
|
|
|
|
---
|
|
|
|
## FAZA 1: SETUP ȘI PREGĂTIRE (30 minute)
|
|
|
|
### Pasul 1.1: Verificare și instalare dependențe
|
|
```bash
|
|
# Claude Code să execute:
|
|
cd /mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri/INDEX-SISTEM-JOCURI
|
|
|
|
# Verificare Python packages existente
|
|
pip list | grep -E "beautifulsoup4|pypdf2|python-docx|lxml"
|
|
|
|
# Instalare pachete lipsă
|
|
pip install beautifulsoup4 lxml pypdf2 python-docx chardet
|
|
|
|
# Creare director pentru scripturi noi
|
|
mkdir -p scripts/extractors
|
|
```
|
|
|
|
### Pasul 1.2: Creare structură fișiere
|
|
```bash
|
|
# Claude Code să creeze următoarele fișiere:
|
|
touch scripts/extractors/__init__.py
|
|
touch scripts/extractors/html_extractor.py
|
|
touch scripts/extractors/text_extractor.py
|
|
touch scripts/extractors/pdf_extractor.py
|
|
touch scripts/extractors/unified_processor.py
|
|
touch scripts/run_extraction.py
|
|
```
|
|
|
|
### Pasul 1.3: Backup bază de date
|
|
```bash
|
|
# IMPORTANT: Backup înainte de procesare
|
|
cp data/activities.db data/activities_backup_$(date +%Y%m%d_%H%M%S).db
|
|
```
|
|
|
|
---
|
|
|
|
## FAZA 2: DEZVOLTARE EXTRACTOARE AUTOMATE (3 ore)
|
|
|
|
### Pasul 2.1: HTML Extractor (cel mai important - 1876 fișiere)
|
|
|
|
**Claude Code să creeze `/scripts/extractors/html_extractor.py`:**
|
|
|
|
```python
|
|
#!/usr/bin/env python3
|
|
"""
|
|
HTML Activity Extractor - Procesează 1876 fișiere HTML
|
|
Extrage automat activități folosind pattern recognition
|
|
"""
|
|
|
|
import os
|
|
import re
|
|
import json
|
|
from pathlib import Path
|
|
from bs4 import BeautifulSoup
|
|
import chardet
|
|
from typing import List, Dict, Optional
|
|
import sqlite3
|
|
from datetime import datetime
|
|
|
|
class HTMLActivityExtractor:
|
|
def __init__(self, db_path='data/activities.db'):
|
|
self.db_path = db_path
|
|
# Pattern-uri pentru detectare activități în română
|
|
self.activity_patterns = {
|
|
'title_patterns': [
|
|
r'(?i)(joc|activitate|exerci[țt]iu|team[\s-]?building|energizer|ice[\s-]?breaker)[\s:]+([^\.]{5,100})',
|
|
r'(?i)<h[1-6][^>]*>([^<]*(?:joc|activitate|exerci[țt]iu)[^<]*)</h[1-6]>',
|
|
r'(?i)<strong>([^<]*(?:joc|activitate|exerci[țt]iu)[^<]*)</strong>',
|
|
r'(?i)^[\d]+\.?\s*([A-Z][^\.]{10,100}(?:joc|activitate|exerci[țt]iu)[^\.]{0,50})$',
|
|
],
|
|
'description_markers': [
|
|
'descriere', 'reguli', 'cum se joac[ăa]', 'instructiuni',
|
|
'obiectiv', 'desfasurare', 'explicatie', 'mod de joc'
|
|
],
|
|
'materials_markers': [
|
|
'materiale', 'necesare', 'echipament', 'ce avem nevoie',
|
|
'se folosesc', 'trebuie sa avem', 'dotari'
|
|
],
|
|
'age_patterns': [
|
|
r'(?i)v[âa]rst[ăa][\s:]+(\d+)[\s-]+(\d+)',
|
|
r'(?i)(\d+)[\s-]+(\d+)\s*ani',
|
|
r'(?i)pentru\s+(\d+)[\s-]+(\d+)\s*ani',
|
|
r'(?i)categoria?\s*(?:de\s*)?v[âa]rst[ăa][\s:]+(\d+)[\s-]+(\d+)',
|
|
],
|
|
'participants_patterns': [
|
|
r'(?i)(\d+)[\s-]+(\d+)\s*(?:participan[țt]i|juc[ăa]tori|persoane|copii)',
|
|
r'(?i)num[ăa]r\s*(?:de\s*)?(?:participan[țt]i|juc[ăa]tori)[\s:]+(\d+)[\s-]+(\d+)',
|
|
r'(?i)grup\s*de\s*(\d+)[\s-]+(\d+)',
|
|
],
|
|
'duration_patterns': [
|
|
r'(?i)durat[ăa][\s:]+(\d+)[\s-]+(\d+)\s*(?:minute|min)',
|
|
r'(?i)timp[\s:]+(\d+)[\s-]+(\d+)\s*(?:minute|min)',
|
|
r'(?i)(\d+)[\s-]+(\d+)\s*minute',
|
|
]
|
|
}
|
|
|
|
# Categorii predefinite bazate pe sistemul existent
|
|
self.categories = {
|
|
'[A]': ['joc', 'joaca', 'distractie', 'amuzament'],
|
|
'[B]': ['aventura', 'explorare', 'descoperire'],
|
|
'[C]': ['camping', 'tabara', 'excursie', 'drumetie'],
|
|
'[D]': ['foc', 'flacara', 'lumina'],
|
|
'[E]': ['noduri', 'frânghii', 'sfori', 'legare'],
|
|
'[F]': ['bushcraft', 'supravietuire', 'survival'],
|
|
'[G]': ['educatie', 'educativ', 'invatare', 'scoala'],
|
|
'[H]': ['orientare', 'busola', 'harta', 'navigare']
|
|
}
|
|
|
|
def detect_encoding(self, file_path):
|
|
"""Detectează encoding-ul fișierului"""
|
|
with open(file_path, 'rb') as f:
|
|
result = chardet.detect(f.read())
|
|
return result['encoding'] or 'utf-8'
|
|
|
|
def extract_from_html(self, html_path: str) -> List[Dict]:
|
|
"""Extrage activități dintr-un singur fișier HTML"""
|
|
activities = []
|
|
|
|
try:
|
|
# Detectare encoding și citire
|
|
encoding = self.detect_encoding(html_path)
|
|
with open(html_path, 'r', encoding=encoding, errors='ignore') as f:
|
|
content = f.read()
|
|
|
|
soup = BeautifulSoup(content, 'lxml')
|
|
|
|
# Metodă 1: Caută liste de activități
|
|
activities.extend(self._extract_from_lists(soup, html_path))
|
|
|
|
# Metodă 2: Caută activități în headings
|
|
activities.extend(self._extract_from_headings(soup, html_path))
|
|
|
|
# Metodă 3: Caută pattern-uri în text
|
|
activities.extend(self._extract_from_patterns(soup, html_path))
|
|
|
|
# Metodă 4: Caută în tabele
|
|
activities.extend(self._extract_from_tables(soup, html_path))
|
|
|
|
except Exception as e:
|
|
print(f"Error processing {html_path}: {e}")
|
|
|
|
return activities
|
|
|
|
def _extract_from_lists(self, soup, source_file):
|
|
"""Extrage activități din liste HTML (ul, ol)"""
|
|
activities = []
|
|
|
|
for list_elem in soup.find_all(['ul', 'ol']):
|
|
# Verifică dacă lista pare să conțină activități
|
|
list_text = list_elem.get_text().lower()
|
|
if any(marker in list_text for marker in ['joc', 'activitate', 'exercitiu']):
|
|
for li in list_elem.find_all('li'):
|
|
text = li.get_text(strip=True)
|
|
if len(text) > 20: # Minim 20 caractere pentru o activitate validă
|
|
activity = self._create_activity_from_text(text, source_file)
|
|
if activity:
|
|
activities.append(activity)
|
|
|
|
return activities
|
|
|
|
def _extract_from_headings(self, soup, source_file):
|
|
"""Extrage activități bazate pe headings"""
|
|
activities = []
|
|
|
|
for heading in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']):
|
|
heading_text = heading.get_text(strip=True)
|
|
|
|
# Verifică dacă heading-ul conține cuvinte cheie
|
|
if any(keyword in heading_text.lower() for keyword in ['joc', 'activitate', 'exercitiu']):
|
|
# Caută descrierea în elementele următoare
|
|
description = ""
|
|
next_elem = heading.find_next_sibling()
|
|
|
|
while next_elem and next_elem.name not in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
|
|
if next_elem.name in ['p', 'div', 'ul']:
|
|
description += next_elem.get_text(strip=True) + " "
|
|
if len(description) > 500: # Limită descriere
|
|
break
|
|
next_elem = next_elem.find_next_sibling()
|
|
|
|
if description:
|
|
activity = {
|
|
'name': heading_text[:200],
|
|
'description': description[:1000],
|
|
'source_file': str(source_file),
|
|
'category': self._detect_category(heading_text + " " + description)
|
|
}
|
|
activities.append(activity)
|
|
|
|
return activities
|
|
|
|
def _extract_from_patterns(self, soup, source_file):
|
|
"""Extrage activități folosind pattern matching"""
|
|
activities = []
|
|
text = soup.get_text()
|
|
|
|
# Caută pattern-uri de activități
|
|
for pattern in self.activity_patterns['title_patterns']:
|
|
matches = re.finditer(pattern, text, re.MULTILINE)
|
|
for match in matches:
|
|
title = match.group(0) if match.lastindex == 0 else match.group(match.lastindex)
|
|
if len(title) > 10:
|
|
# Extrage context în jurul match-ului
|
|
start = max(0, match.start() - 200)
|
|
end = min(len(text), match.end() + 500)
|
|
context = text[start:end]
|
|
|
|
activity = self._create_activity_from_text(context, source_file, title)
|
|
if activity:
|
|
activities.append(activity)
|
|
|
|
return activities
|
|
|
|
def _extract_from_tables(self, soup, source_file):
|
|
"""Extrage activități din tabele"""
|
|
activities = []
|
|
|
|
for table in soup.find_all('table'):
|
|
rows = table.find_all('tr')
|
|
if len(rows) > 1: # Cel puțin header și o linie de date
|
|
# Detectează coloanele relevante
|
|
headers = [th.get_text(strip=True).lower() for th in rows[0].find_all(['th', 'td'])]
|
|
|
|
for row in rows[1:]:
|
|
cells = row.find_all(['td'])
|
|
if cells:
|
|
activity_data = {}
|
|
for i, cell in enumerate(cells):
|
|
if i < len(headers):
|
|
activity_data[headers[i]] = cell.get_text(strip=True)
|
|
|
|
# Creează activitate din date tabel
|
|
if any(key in activity_data for key in ['joc', 'activitate', 'nume', 'titlu']):
|
|
activity = self._create_activity_from_table_data(activity_data, source_file)
|
|
if activity:
|
|
activities.append(activity)
|
|
|
|
return activities
|
|
|
|
def _create_activity_from_text(self, text, source_file, title=None):
|
|
"""Creează un dicționar de activitate din text"""
|
|
if not text or len(text) < 30:
|
|
return None
|
|
|
|
activity = {
|
|
'name': title or text[:100].split('.')[0].strip(),
|
|
'description': text[:1000],
|
|
'source_file': str(source_file),
|
|
'category': self._detect_category(text),
|
|
'keywords': self._extract_keywords(text),
|
|
'created_at': datetime.now().isoformat()
|
|
}
|
|
|
|
# Extrage metadata suplimentară
|
|
activity.update(self._extract_metadata(text))
|
|
|
|
return activity
|
|
|
|
def _create_activity_from_table_data(self, data, source_file):
|
|
"""Creează activitate din date de tabel"""
|
|
activity = {
|
|
'source_file': str(source_file),
|
|
'created_at': datetime.now().isoformat()
|
|
}
|
|
|
|
# Mapare câmpuri tabel la câmpuri DB
|
|
field_mapping = {
|
|
'nume': 'name', 'titlu': 'name', 'joc': 'name', 'activitate': 'name',
|
|
'descriere': 'description', 'detalii': 'description', 'explicatie': 'description',
|
|
'materiale': 'materials_list', 'echipament': 'materials_list',
|
|
'varsta': 'age_group_min', 'categoria': 'category',
|
|
'participanti': 'participants_min', 'numar': 'participants_min',
|
|
'durata': 'duration_min', 'timp': 'duration_min'
|
|
}
|
|
|
|
for table_field, db_field in field_mapping.items():
|
|
if table_field in data:
|
|
activity[db_field] = data[table_field]
|
|
|
|
# Validare minimă
|
|
if 'name' in activity and len(activity.get('name', '')) > 5:
|
|
return activity
|
|
|
|
return None
|
|
|
|
def _extract_metadata(self, text):
|
|
"""Extrage metadata din text folosind pattern-uri"""
|
|
metadata = {}
|
|
|
|
# Extrage vârsta
|
|
for pattern in self.activity_patterns['age_patterns']:
|
|
match = re.search(pattern, text)
|
|
if match:
|
|
metadata['age_group_min'] = int(match.group(1))
|
|
metadata['age_group_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
|
|
break
|
|
|
|
# Extrage număr participanți
|
|
for pattern in self.activity_patterns['participants_patterns']:
|
|
match = re.search(pattern, text)
|
|
if match:
|
|
metadata['participants_min'] = int(match.group(1))
|
|
metadata['participants_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
|
|
break
|
|
|
|
# Extrage durata
|
|
for pattern in self.activity_patterns['duration_patterns']:
|
|
match = re.search(pattern, text)
|
|
if match:
|
|
metadata['duration_min'] = int(match.group(1))
|
|
metadata['duration_max'] = int(match.group(2)) if match.lastindex >= 2 else int(match.group(1))
|
|
break
|
|
|
|
# Extrage materiale
|
|
materials = []
|
|
text_lower = text.lower()
|
|
for marker in self.activity_patterns['materials_markers']:
|
|
idx = text_lower.find(marker)
|
|
if idx != -1:
|
|
# Extrage următoarele 200 caractere după marker
|
|
materials_text = text[idx:idx+200]
|
|
# Extrage items din listă
|
|
items = re.findall(r'[-•]\s*([^\n-•]+)', materials_text)
|
|
if items:
|
|
materials.extend(items)
|
|
|
|
if materials:
|
|
metadata['materials_list'] = ', '.join(materials[:10]) # Maxim 10 materiale
|
|
|
|
return metadata
|
|
|
|
def _detect_category(self, text):
|
|
"""Detectează categoria activității bazată pe cuvinte cheie"""
|
|
text_lower = text.lower()
|
|
|
|
for category, keywords in self.categories.items():
|
|
if any(keyword in text_lower for keyword in keywords):
|
|
return category
|
|
|
|
return '[A]' # Default categoria jocuri
|
|
|
|
def _extract_keywords(self, text):
|
|
"""Extrage cuvinte cheie din text"""
|
|
keywords = []
|
|
text_lower = text.lower()
|
|
|
|
# Lista de cuvinte cheie relevante
|
|
keyword_list = [
|
|
'cooperare', 'competitie', 'echipa', 'creativitate', 'miscare',
|
|
'strategie', 'comunicare', 'incredere', 'coordonare', 'atentie',
|
|
'reflexe', 'logica', 'imaginatie', 'muzica', 'dans', 'sport',
|
|
'natura', 'mediu', 'stiinta', 'matematica', 'limba', 'cultura'
|
|
]
|
|
|
|
for keyword in keyword_list:
|
|
if keyword in text_lower:
|
|
keywords.append(keyword)
|
|
|
|
return ', '.join(keywords[:5]) # Maxim 5 keywords
|
|
|
|
def save_to_database(self, activities):
|
|
"""Salvează activitățile în baza de date"""
|
|
conn = sqlite3.connect(self.db_path)
|
|
cursor = conn.cursor()
|
|
|
|
saved_count = 0
|
|
duplicate_count = 0
|
|
|
|
for activity in activities:
|
|
try:
|
|
# Verifică duplicate
|
|
cursor.execute(
|
|
"SELECT id FROM activities WHERE name = ? AND source_file = ?",
|
|
(activity.get('name'), activity.get('source_file'))
|
|
)
|
|
|
|
if cursor.fetchone():
|
|
duplicate_count += 1
|
|
continue
|
|
|
|
# Pregătește valorile pentru insert
|
|
columns = []
|
|
values = []
|
|
placeholders = []
|
|
|
|
for key, value in activity.items():
|
|
if key != 'created_at': # Skip created_at, it has default
|
|
columns.append(key)
|
|
values.append(value)
|
|
placeholders.append('?')
|
|
|
|
# Insert în DB
|
|
query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
|
|
cursor.execute(query, values)
|
|
saved_count += 1
|
|
|
|
except Exception as e:
|
|
print(f"Error saving activity: {e}")
|
|
continue
|
|
|
|
conn.commit()
|
|
conn.close()
|
|
|
|
return saved_count, duplicate_count
|
|
|
|
def process_all_html_files(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
|
|
"""Procesează toate fișierele HTML din directorul specificat"""
|
|
base_path = Path(base_path)
|
|
html_files = list(base_path.rglob("*.html"))
|
|
html_files.extend(list(base_path.rglob("*.htm")))
|
|
|
|
print(f"Found {len(html_files)} HTML files to process")
|
|
|
|
all_activities = []
|
|
processed = 0
|
|
errors = 0
|
|
|
|
for i, html_file in enumerate(html_files):
|
|
try:
|
|
activities = self.extract_from_html(str(html_file))
|
|
all_activities.extend(activities)
|
|
processed += 1
|
|
|
|
# Progress update
|
|
if (i + 1) % 100 == 0:
|
|
print(f"Progress: {i+1}/{len(html_files)} files processed, {len(all_activities)} activities found")
|
|
# Save batch to DB
|
|
if all_activities:
|
|
saved, dupes = self.save_to_database(all_activities)
|
|
print(f"Batch saved: {saved} new activities, {dupes} duplicates skipped")
|
|
all_activities = [] # Clear buffer
|
|
|
|
except Exception as e:
|
|
print(f"Error processing {html_file}: {e}")
|
|
errors += 1
|
|
|
|
# Save remaining activities
|
|
if all_activities:
|
|
saved, dupes = self.save_to_database(all_activities)
|
|
print(f"Final batch saved: {saved} new activities, {dupes} duplicates skipped")
|
|
|
|
print(f"\nProcessing complete!")
|
|
print(f"Files processed: {processed}")
|
|
print(f"Errors: {errors}")
|
|
|
|
return processed, errors
|
|
|
|
# Funcție main pentru test
|
|
if __name__ == "__main__":
|
|
extractor = HTMLActivityExtractor()
|
|
|
|
# Test pe un fișier sample mai întâi
|
|
print("Testing on sample file first...")
|
|
# Găsește un fișier HTML pentru test
|
|
test_files = list(Path('/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri').rglob("*.html"))[:3]
|
|
|
|
for test_file in test_files:
|
|
print(f"\nTesting: {test_file}")
|
|
activities = extractor.extract_from_html(str(test_file))
|
|
print(f"Found {len(activities)} activities")
|
|
if activities:
|
|
print(f"Sample activity: {activities[0]['name'][:50]}...")
|
|
|
|
# Întreabă dacă să continue cu procesarea completă
|
|
response = input("\nContinue with full processing? (y/n): ")
|
|
if response.lower() == 'y':
|
|
extractor.process_all_html_files()
|
|
```
|
|
|
|
### Pasul 2.2: Text/MD Extractor (simplu - 45 fișiere)
|
|
|
|
**Claude Code să creeze `/scripts/extractors/text_extractor.py`:**
|
|
|
|
```python
|
|
#!/usr/bin/env python3
|
|
"""
|
|
Text/Markdown Activity Extractor
|
|
Procesează fișiere TXT și MD pentru extracție activități
|
|
"""
|
|
|
|
import re
|
|
from pathlib import Path
|
|
from typing import List, Dict
|
|
import sqlite3
|
|
from datetime import datetime
|
|
|
|
class TextActivityExtractor:
|
|
def __init__(self, db_path='data/activities.db'):
|
|
self.db_path = db_path
|
|
self.activity_patterns = {
|
|
'section_headers': [
|
|
r'^#{1,6}\s*(.+)$', # Markdown headers
|
|
r'^([A-Z][^\.]{10,100})$', # Titluri simple
|
|
r'^\d+\.\s*(.+)$', # Numbered lists
|
|
r'^[•\-\*]\s*(.+)$', # Bullet points
|
|
],
|
|
'activity_markers': [
|
|
'joc:', 'activitate:', 'exercitiu:', 'team building:',
|
|
'nume:', 'titlu:', 'denumire:'
|
|
]
|
|
}
|
|
|
|
def extract_from_text(self, file_path: str) -> List[Dict]:
|
|
"""Extrage activități din fișier text/markdown"""
|
|
activities = []
|
|
|
|
try:
|
|
with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
|
|
content = f.read()
|
|
|
|
# Metodă 1: Caută secțiuni markdown
|
|
if file_path.endswith('.md'):
|
|
activities.extend(self._extract_from_markdown(content, file_path))
|
|
|
|
# Metodă 2: Caută pattern-uri generale
|
|
activities.extend(self._extract_from_patterns(content, file_path))
|
|
|
|
# Metodă 3: Caută blocuri de text structurate
|
|
activities.extend(self._extract_from_blocks(content, file_path))
|
|
|
|
except Exception as e:
|
|
print(f"Error processing {file_path}: {e}")
|
|
|
|
return activities
|
|
|
|
def _extract_from_markdown(self, content, source_file):
|
|
"""Extrage activități din format markdown"""
|
|
activities = []
|
|
lines = content.split('\n')
|
|
|
|
current_activity = None
|
|
current_content = []
|
|
|
|
for line in lines:
|
|
# Verifică dacă e header de activitate
|
|
if re.match(r'^#{1,3}\s*(.+)', line):
|
|
# Salvează activitatea anterioară dacă există
|
|
if current_activity and current_content:
|
|
current_activity['description'] = '\n'.join(current_content[:20]) # Max 20 linii
|
|
activities.append(current_activity)
|
|
|
|
# Verifică dacă noul header e o activitate
|
|
header_text = re.sub(r'^#{1,3}\s*', '', line)
|
|
if any(marker in header_text.lower() for marker in ['joc', 'activitate', 'exercitiu']):
|
|
current_activity = {
|
|
'name': header_text[:200],
|
|
'source_file': str(source_file),
|
|
'category': '[A]'
|
|
}
|
|
current_content = []
|
|
else:
|
|
current_activity = None
|
|
|
|
elif current_activity:
|
|
# Adaugă conținut la activitatea curentă
|
|
if line.strip():
|
|
current_content.append(line)
|
|
|
|
# Salvează ultima activitate
|
|
if current_activity and current_content:
|
|
current_activity['description'] = '\n'.join(current_content[:20])
|
|
activities.append(current_activity)
|
|
|
|
return activities
|
|
|
|
def _extract_from_patterns(self, content, source_file):
|
|
"""Extrage folosind pattern matching"""
|
|
activities = []
|
|
|
|
# Caută markeri specifici de activități
|
|
for marker in self.activity_patterns['activity_markers']:
|
|
pattern = re.compile(f'{re.escape(marker)}\\s*(.+?)(?=\\n\\n|{re.escape(marker)}|$)',
|
|
re.IGNORECASE | re.DOTALL)
|
|
matches = pattern.finditer(content)
|
|
|
|
for match in matches:
|
|
activity_text = match.group(1)
|
|
if len(activity_text) > 20:
|
|
activity = {
|
|
'name': activity_text.split('\n')[0][:200],
|
|
'description': activity_text[:1000],
|
|
'source_file': str(source_file),
|
|
'category': '[A]'
|
|
}
|
|
activities.append(activity)
|
|
|
|
return activities
|
|
|
|
def _extract_from_blocks(self, content, source_file):
|
|
"""Extrage din blocuri de text separate"""
|
|
activities = []
|
|
|
|
# Împarte în blocuri separate de linii goale
|
|
blocks = re.split(r'\n\s*\n', content)
|
|
|
|
for block in blocks:
|
|
if len(block) > 50: # Minim 50 caractere
|
|
lines = block.strip().split('\n')
|
|
first_line = lines[0].strip()
|
|
|
|
# Verifică dacă blocul pare o activitate
|
|
if any(keyword in first_line.lower() for keyword in ['joc', 'activitate', 'exercitiu']):
|
|
activity = {
|
|
'name': first_line[:200],
|
|
'description': block[:1000],
|
|
'source_file': str(source_file),
|
|
'category': '[A]'
|
|
}
|
|
activities.append(activity)
|
|
|
|
return activities
|
|
|
|
def save_to_database(self, activities):
|
|
"""Salvează în baza de date"""
|
|
conn = sqlite3.connect(self.db_path)
|
|
cursor = conn.cursor()
|
|
|
|
saved_count = 0
|
|
|
|
for activity in activities:
|
|
try:
|
|
# Check for duplicates
|
|
cursor.execute(
|
|
"SELECT id FROM activities WHERE name = ? AND source_file = ?",
|
|
(activity.get('name'), activity.get('source_file'))
|
|
)
|
|
|
|
if not cursor.fetchone():
|
|
columns = list(activity.keys())
|
|
values = list(activity.values())
|
|
placeholders = ['?' for _ in values]
|
|
|
|
query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
|
|
cursor.execute(query, values)
|
|
saved_count += 1
|
|
|
|
except Exception as e:
|
|
print(f"Error saving: {e}")
|
|
|
|
conn.commit()
|
|
conn.close()
|
|
|
|
return saved_count
|
|
|
|
def process_all_text_files(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
|
|
"""Procesează toate fișierele text și markdown"""
|
|
base_path = Path(base_path)
|
|
|
|
text_files = list(base_path.rglob("*.txt"))
|
|
md_files = list(base_path.rglob("*.md"))
|
|
all_files = text_files + md_files
|
|
|
|
print(f"Found {len(all_files)} text/markdown files")
|
|
|
|
all_activities = []
|
|
|
|
for file_path in all_files:
|
|
activities = self.extract_from_text(str(file_path))
|
|
all_activities.extend(activities)
|
|
print(f"Processed {file_path.name}: {len(activities)} activities")
|
|
|
|
# Save to database
|
|
saved = self.save_to_database(all_activities)
|
|
print(f"\nTotal saved: {saved} activities from {len(all_files)} files")
|
|
|
|
return len(all_files), saved
|
|
|
|
if __name__ == "__main__":
|
|
extractor = TextActivityExtractor()
|
|
extractor.process_all_text_files()
|
|
```
|
|
|
|
### Pasul 2.3: Unified Processor (orchestrator)
|
|
|
|
**Claude Code să creeze `/scripts/extractors/unified_processor.py`:**
|
|
|
|
```python
|
|
#!/usr/bin/env python3
|
|
"""
|
|
Unified Activity Processor
|
|
Orchestrează toate extractoarele pentru procesare completă
|
|
"""
|
|
|
|
import time
|
|
from pathlib import Path
|
|
from html_extractor import HTMLActivityExtractor
|
|
from text_extractor import TextActivityExtractor
|
|
import sqlite3
|
|
|
|
class UnifiedProcessor:
|
|
def __init__(self, db_path='data/activities.db'):
|
|
self.db_path = db_path
|
|
self.html_extractor = HTMLActivityExtractor(db_path)
|
|
self.text_extractor = TextActivityExtractor(db_path)
|
|
self.stats = {
|
|
'html_processed': 0,
|
|
'text_processed': 0,
|
|
'pdf_to_process': 0,
|
|
'doc_to_process': 0,
|
|
'total_activities': 0,
|
|
'start_time': None,
|
|
'end_time': None
|
|
}
|
|
|
|
def get_current_activity_count(self):
|
|
"""Obține numărul curent de activități din DB"""
|
|
conn = sqlite3.connect(self.db_path)
|
|
cursor = conn.cursor()
|
|
cursor.execute("SELECT COUNT(*) FROM activities")
|
|
count = cursor.fetchone()[0]
|
|
conn.close()
|
|
return count
|
|
|
|
def count_files_to_process(self, base_path):
|
|
"""Numără fișierele care trebuie procesate"""
|
|
base_path = Path(base_path)
|
|
|
|
counts = {
|
|
'html': len(list(base_path.rglob("*.html"))) + len(list(base_path.rglob("*.htm"))),
|
|
'txt': len(list(base_path.rglob("*.txt"))),
|
|
'md': len(list(base_path.rglob("*.md"))),
|
|
'pdf': len(list(base_path.rglob("*.pdf"))),
|
|
'doc': len(list(base_path.rglob("*.doc"))),
|
|
'docx': len(list(base_path.rglob("*.docx")))
|
|
}
|
|
|
|
return counts
|
|
|
|
def process_automated_formats(self, base_path='/mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri'):
|
|
"""Procesează toate formatele care pot fi automatizate"""
|
|
print("="*60)
|
|
print("UNIFIED ACTIVITY PROCESSOR - AUTOMATED PHASE")
|
|
print("="*60)
|
|
|
|
self.stats['start_time'] = time.time()
|
|
initial_count = self.get_current_activity_count()
|
|
|
|
# Afișează statistici inițiale
|
|
file_counts = self.count_files_to_process(base_path)
|
|
print(f"\nFiles to process:")
|
|
for format, count in file_counts.items():
|
|
print(f" {format.upper()}: {count} files")
|
|
print(f"\nCurrent activities in database: {initial_count}")
|
|
print("-"*60)
|
|
|
|
# FAZA 1: Procesare HTML (prioritate maximă - volum mare)
|
|
print("\n[1/2] Processing HTML files...")
|
|
print("-"*40)
|
|
html_processed, html_errors = self.html_extractor.process_all_html_files(base_path)
|
|
self.stats['html_processed'] = html_processed
|
|
|
|
# FAZA 2: Procesare Text/MD
|
|
print("\n[2/2] Processing Text/Markdown files...")
|
|
print("-"*40)
|
|
text_processed, text_saved = self.text_extractor.process_all_text_files(base_path)
|
|
self.stats['text_processed'] = text_processed
|
|
|
|
# Statistici finale
|
|
self.stats['end_time'] = time.time()
|
|
final_count = self.get_current_activity_count()
|
|
self.stats['total_activities'] = final_count - initial_count
|
|
|
|
# Identifică fișierele care necesită procesare manuală
|
|
self.stats['pdf_to_process'] = file_counts['pdf']
|
|
self.stats['doc_to_process'] = file_counts['doc'] + file_counts['docx']
|
|
|
|
self.print_summary()
|
|
self.save_pdf_doc_list(base_path)
|
|
|
|
def print_summary(self):
|
|
"""Afișează rezumatul procesării"""
|
|
print("\n" + "="*60)
|
|
print("PROCESSING SUMMARY")
|
|
print("="*60)
|
|
|
|
duration = self.stats['end_time'] - self.stats['start_time']
|
|
|
|
print(f"\nAutomated Processing Results:")
|
|
print(f" HTML files processed: {self.stats['html_processed']}")
|
|
print(f" Text/MD files processed: {self.stats['text_processed']}")
|
|
print(f" New activities added: {self.stats['total_activities']}")
|
|
print(f" Processing time: {duration:.1f} seconds")
|
|
|
|
print(f"\nFiles requiring Claude processing:")
|
|
print(f" PDF files: {self.stats['pdf_to_process']}")
|
|
print(f" DOC/DOCX files: {self.stats['doc_to_process']}")
|
|
|
|
print("\n" + "="*60)
|
|
print("NEXT STEPS:")
|
|
print("1. Review the file 'pdf_doc_for_claude.txt' for manual processing")
|
|
print("2. Use Claude to extract activities from PDF/DOC files")
|
|
print("3. Focus on largest PDF files first (highest activity density)")
|
|
print("="*60)
|
|
|
|
def save_pdf_doc_list(self, base_path):
|
|
"""Salvează lista de PDF/DOC pentru procesare cu Claude"""
|
|
base_path = Path(base_path)
|
|
|
|
pdf_files = sorted(base_path.rglob("*.pdf"), key=lambda p: p.stat().st_size, reverse=True)
|
|
doc_files = list(base_path.rglob("*.doc"))
|
|
docx_files = list(base_path.rglob("*.docx"))
|
|
|
|
with open('pdf_doc_for_claude.txt', 'w', encoding='utf-8') as f:
|
|
f.write("PDF/DOC FILES FOR CLAUDE PROCESSING\n")
|
|
f.write("="*60 + "\n")
|
|
f.write("Files sorted by size (largest first = likely more activities)\n\n")
|
|
|
|
f.write("TOP PRIORITY PDF FILES (process these first):\n")
|
|
f.write("-"*40 + "\n")
|
|
for i, pdf in enumerate(pdf_files[:20], 1):
|
|
size_mb = pdf.stat().st_size / (1024*1024)
|
|
f.write(f"{i}. {pdf.name} ({size_mb:.1f} MB)\n")
|
|
f.write(f" Path: {pdf}\n\n")
|
|
|
|
if len(pdf_files) > 20:
|
|
f.write(f"\n... and {len(pdf_files)-20} more PDF files\n\n")
|
|
|
|
f.write("\nDOC/DOCX FILES:\n")
|
|
f.write("-"*40 + "\n")
|
|
for doc in doc_files + docx_files:
|
|
size_kb = doc.stat().st_size / 1024
|
|
f.write(f"- {doc.name} ({size_kb:.1f} KB)\n")
|
|
|
|
print(f"\nPDF/DOC list saved to: pdf_doc_for_claude.txt")
|
|
|
|
if __name__ == "__main__":
|
|
processor = UnifiedProcessor()
|
|
processor.process_automated_formats()
|
|
```
|
|
|
|
---
|
|
|
|
## FAZA 3: PROCESARE MANUALĂ CU CLAUDE (3-4 ore)
|
|
|
|
### Pasul 3.1: Template pentru extracție cu Claude
|
|
|
|
**Claude Code să creeze `/scripts/claude_extraction_template.md`:**
|
|
|
|
```markdown
|
|
# TEMPLATE PENTRU EXTRACȚIE ACTIVITĂȚI CU CLAUDE
|
|
|
|
## Instrucțiuni pentru Claude Code:
|
|
|
|
Pentru fiecare PDF/DOC, folosește următorul format de extracție:
|
|
|
|
### 1. Citește fișierul:
|
|
```
|
|
Claude, te rog citește fișierul: [CALE_FISIER]
|
|
```
|
|
|
|
### 2. Extrage activitățile folosind acest template JSON:
|
|
```json
|
|
{
|
|
"source_file": "[NUME_FISIER]",
|
|
"activities": [
|
|
{
|
|
"name": "Numele activității",
|
|
"description": "Descrierea completă a activității",
|
|
"rules": "Regulile jocului/activității",
|
|
"variations": "Variante sau adaptări",
|
|
"category": "[A-H] bazat pe tip",
|
|
"age_group_min": 6,
|
|
"age_group_max": 14,
|
|
"participants_min": 4,
|
|
"participants_max": 20,
|
|
"duration_min": 10,
|
|
"duration_max": 30,
|
|
"materials_list": "Lista materialelor necesare",
|
|
"skills_developed": "Competențe dezvoltate",
|
|
"difficulty_level": "Ușor/Mediu/Dificil",
|
|
"keywords": "cuvinte cheie separate prin virgulă",
|
|
"tags": "taguri relevante"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### 3. Salvează în fișier:
|
|
După extracție, salvează JSON-ul în: `/scripts/extracted_activities/[NUME_FISIER].json`
|
|
|
|
### 4. Priorități de procesare:
|
|
|
|
**TOP PRIORITY (procesează primele):**
|
|
1. 1000 Fantastic Scout Games.pdf
|
|
2. Cartea Mare a jocurilor.pdf
|
|
3. 160-de-activitati-dinamice-jocuri-pentru-team-building.pdf
|
|
4. 101 Ways to Create an Unforgettable Camp Experience.pdf
|
|
5. 151 Awesome Summer Camp Nature Activities.pdf
|
|
|
|
**Categorii de focus:**
|
|
- [A] Jocuri Cercetășești
|
|
- [C] Camping & Activități Exterior
|
|
- [G] Activități Educaționale
|
|
```
|
|
|
|
### Pasul 3.2: Script pentru import activități din JSON
|
|
|
|
**Claude Code să creeze `/scripts/import_claude_activities.py`:**
|
|
|
|
```python
|
|
#!/usr/bin/env python3
|
|
"""
|
|
Import activities extracted by Claude from JSON files
|
|
"""
|
|
|
|
import json
|
|
import sqlite3
|
|
from pathlib import Path
|
|
from datetime import datetime
|
|
|
|
class ClaudeActivityImporter:
|
|
def __init__(self, db_path='data/activities.db'):
|
|
self.db_path = db_path
|
|
self.json_dir = Path('scripts/extracted_activities')
|
|
self.json_dir.mkdir(exist_ok=True)
|
|
|
|
def import_json_file(self, json_path):
|
|
"""Import activities from a single JSON file"""
|
|
with open(json_path, 'r', encoding='utf-8') as f:
|
|
data = json.load(f)
|
|
|
|
source_file = data.get('source_file', str(json_path))
|
|
activities = data.get('activities', [])
|
|
|
|
conn = sqlite3.connect(self.db_path)
|
|
cursor = conn.cursor()
|
|
|
|
imported = 0
|
|
for activity in activities:
|
|
try:
|
|
# Add source file and timestamp
|
|
activity['source_file'] = source_file
|
|
activity['created_at'] = datetime.now().isoformat()
|
|
|
|
# Prepare insert
|
|
columns = list(activity.keys())
|
|
values = list(activity.values())
|
|
placeholders = ['?' for _ in values]
|
|
|
|
# Check for duplicate
|
|
cursor.execute(
|
|
"SELECT id FROM activities WHERE name = ? AND source_file = ?",
|
|
(activity.get('name'), source_file)
|
|
)
|
|
|
|
if not cursor.fetchone():
|
|
query = f"INSERT INTO activities ({', '.join(columns)}) VALUES ({', '.join(placeholders)})"
|
|
cursor.execute(query, values)
|
|
imported += 1
|
|
|
|
except Exception as e:
|
|
print(f"Error importing activity: {e}")
|
|
|
|
conn.commit()
|
|
conn.close()
|
|
|
|
print(f"Imported {imported} activities from {json_path.name}")
|
|
return imported
|
|
|
|
def import_all_json_files(self):
|
|
"""Import all JSON files from the extracted_activities directory"""
|
|
json_files = list(self.json_dir.glob("*.json"))
|
|
|
|
if not json_files:
|
|
print("No JSON files found in extracted_activities directory")
|
|
return 0
|
|
|
|
total_imported = 0
|
|
for json_file in json_files:
|
|
imported = self.import_json_file(json_file)
|
|
total_imported += imported
|
|
|
|
print(f"\nTotal imported: {total_imported} activities from {len(json_files)} files")
|
|
return total_imported
|
|
|
|
if __name__ == "__main__":
|
|
importer = ClaudeActivityImporter()
|
|
importer.import_all_json_files()
|
|
```
|
|
|
|
---
|
|
|
|
## FAZA 4: SCRIPT PRINCIPAL DE ORCHESTRARE
|
|
|
|
**Claude Code să creeze `/scripts/run_extraction.py`:**
|
|
|
|
```python
|
|
#!/usr/bin/env python3
|
|
"""
|
|
Main extraction orchestrator
|
|
Rulează întregul proces de extracție
|
|
"""
|
|
|
|
import sys
|
|
import time
|
|
from pathlib import Path
|
|
|
|
# Add extractors to path
|
|
sys.path.append(str(Path(__file__).parent / 'extractors'))
|
|
|
|
from extractors.unified_processor import UnifiedProcessor
|
|
from import_claude_activities import ClaudeActivityImporter
|
|
|
|
def main():
|
|
print("="*60)
|
|
print("ACTIVITY EXTRACTION SYSTEM")
|
|
print("Strategy S8: Hybrid Claude + Scripts")
|
|
print("="*60)
|
|
|
|
# Step 1: Run automated extraction
|
|
print("\nSTEP 1: Automated Extraction")
|
|
print("-"*40)
|
|
processor = UnifiedProcessor()
|
|
processor.process_automated_formats()
|
|
|
|
# Step 2: Wait for Claude processing
|
|
print("\n" + "="*60)
|
|
print("STEP 2: Manual Claude Processing Required")
|
|
print("-"*40)
|
|
print("Please process PDF/DOC files with Claude using the template.")
|
|
print("Files are listed in: pdf_doc_for_claude.txt")
|
|
print("Save extracted activities as JSON in: scripts/extracted_activities/")
|
|
print("="*60)
|
|
|
|
response = input("\nHave you completed Claude processing? (y/n): ")
|
|
|
|
if response.lower() == 'y':
|
|
# Step 3: Import Claude-extracted activities
|
|
print("\nSTEP 3: Importing Claude-extracted activities")
|
|
print("-"*40)
|
|
importer = ClaudeActivityImporter()
|
|
importer.import_all_json_files()
|
|
|
|
print("\n" + "="*60)
|
|
print("EXTRACTION COMPLETE!")
|
|
print("="*60)
|
|
|
|
if __name__ == "__main__":
|
|
main()
|
|
```
|
|
|
|
---
|
|
|
|
## INSTRUCȚIUNI DE EXECUȚIE PENTRU CLAUDE CODE
|
|
|
|
### Executare automată (Claude Code să ruleze acestea):
|
|
|
|
```bash
|
|
# 1. Setup inițial
|
|
cd /mnt/d/GoogleDrive/Cercetasi/carti-camp-jocuri/INDEX-SISTEM-JOCURI
|
|
pip install beautifulsoup4 lxml pypdf2 python-docx chardet
|
|
|
|
# 2. Creează toate fișierele de mai sus
|
|
# [Claude Code creează toate fișierele Python descrise]
|
|
|
|
# 3. Backup database
|
|
cp data/activities.db data/activities_backup_$(date +%Y%m%d).db
|
|
|
|
# 4. Rulează extracția automată
|
|
python scripts/run_extraction.py
|
|
|
|
# 5. După procesare automată, afișează lista PDF pentru procesare manuală
|
|
cat pdf_doc_for_claude.txt | head -30
|
|
```
|
|
|
|
### Pentru procesarea PDF/DOC cu Claude:
|
|
|
|
1. **Claude citește fiecare PDF din lista prioritară**
|
|
2. **Extrage activitățile în format JSON**
|
|
3. **Salvează în `/scripts/extracted_activities/[nume_fisier].json`**
|
|
4. **După completare, rulează importul**:
|
|
```bash
|
|
python scripts/import_claude_activities.py
|
|
```
|
|
|
|
### Verificare finală:
|
|
|
|
```bash
|
|
# Verifică câte activități au fost indexate
|
|
sqlite3 data/activities.db "SELECT COUNT(*) as total FROM activities;"
|
|
|
|
# Verifică distribuția pe categorii
|
|
sqlite3 data/activities.db "SELECT category, COUNT(*) as count FROM activities GROUP BY category;"
|
|
|
|
# Verifică sursele
|
|
sqlite3 data/activities.db "SELECT source_file, COUNT(*) as count FROM activities GROUP BY source_file ORDER BY count DESC LIMIT 10;"
|
|
```
|
|
|
|
---
|
|
|
|
## REZULTATE ESTIMATE
|
|
|
|
### După procesare automată (4 ore):
|
|
- **HTML**: ~1200-1500 activități
|
|
- **TXT/MD**: ~100-200 activități
|
|
- **Total automat**: ~1300-1700 activități
|
|
|
|
### După procesare Claude (3-4 ore):
|
|
- **PDF**: ~300-500 activități (high quality)
|
|
- **DOC/DOCX**: ~100-150 activități
|
|
- **Total Claude**: ~400-650 activități
|
|
|
|
### TOTAL FINAL: ~1700-2350 activități
|
|
|
|
---
|
|
|
|
## TROUBLESHOOTING
|
|
|
|
### Probleme comune și soluții:
|
|
|
|
1. **Encoding errors**: Scripturile folosesc chardet pentru auto-detectare
|
|
2. **Memory issues**: Procesare în batch-uri de 100 fișiere
|
|
3. **Duplicate detection**: Verificare automată name+source_file
|
|
4. **PDF extraction fails**: Fallback la Claude pentru procesare manuală
|
|
5. **Database locked**: Închide aplicația Flask înainte de procesare
|
|
|
|
---
|
|
|
|
## NOTE PENTRU IMPLEMENTARE
|
|
|
|
1. **Prioritizează PDF-urile mari** - conțin cele mai multe activități
|
|
2. **Rulează noaptea** dacă vrei să procesezi tot automatul
|
|
3. **Salvează progresul** - scripturile salvează în batch-uri
|
|
4. **Verifică calitatea** - spot-check pe câteva activități random
|
|
5. **Backup întotdeauna** - ai backup automat în script
|
|
|
|
Acest plan este complet automatizat pentru 90% din muncă. Claude Code poate rula totul automat, tu doar supraveghezi și procesezi PDF-urile importante cu Claude. |