# Claude Code + Karpathy's Autoresearch = INSANE RESULTS!

**URL:** https://youtu.be/0PO6m09_80Q  
**Durată:** 12:44  
**Data salvare:** 2026-03-21  
**Tags:** @work @scout #autoresearch #self-improving #automation #machine-learning

---

## 📋 TL;DR

Autorul construiește un sistem self-improving pentru thumbnails YouTube inspirat din autoresearch loop-ul lui Andrej Karpathy. Sistemul trage date reale (500+ video-uri, CTR din YouTube API), creează eval criteria binare (12 întrebări yes/no despre thumbnail quality), iterează rapid (10 cicluri × 3 thumbnails), își îmbunătățește propriile prompt-uri automat, apoi rulează zilnic cu 4 surse de feedback: YouTube Reporting API (CTR real post-publish), ABC split tests (cel mai high-confidence signal), human feedback din iterații, și fast iterations (offline scoring). Rezultat: creștere de la 8.7/12 la 11/12 eval score în 10 iterații fără intervenție umană. Gap de performanță: thumbnail-uri vechi ~14% CTR vs noi ~3.4% CTR → sistemul învață din ce a funcționat înainte.

---

## 🎯 Puncte cheie

### 1. Data-Driven Eval Criteria (Not Vibes)

**Process:**
- Scraped 180+ video-uri din ultimii 3 ani
- Grupate în 3 categorii: winners (high CTR), losers (low CTR), mid
- Analiză statistică pe titluri și thumbnails

**Data-backed patterns:**
- **"How to"** în titlu: 50% winners vs 23% losers
- **"Tutorial"**: 44% winners vs 13% losers
- **Negative framing** (stop, forget, RIP): doar 6% în winners
- **Exclamation marks**: loser criteria
- **Questions în titlu**: loser criteria

**Concluzie:** Criteriile bazate pe CTR real, nu pe "mi se pare că arată bine"

---

### 2. 12 Binary Eval Questions

Format: **Yes/No** (nu scale 1-10), eliminates ambiguity

**Visual Anchor & Attention:**
1. Single dominant visual anchor (face/graphic) taking 20%+ of frame?
2. Anchor conveys emotion/energy/intrigue?
3. Directional cues present (arrows, pointing)?

**Text & Readability:**
4. Text limited to 1-4 bold, high-contrast words?
5. Text readable at mobile size?

**Composition:**
6. Background simple and uncluttered?
7. Clear visual hierarchy?
8. Shows result/output/transformation (not just tool/process)?

**Branding:**
9. One or more recognizable logos present?

**Packaging (pentru title):**
10-12. Similar criteria pentru titlu (how-to, tutorial, avoid negative framing)

**Why binary:** Consistent scoring, automatable, reproducible

---

### 3. Fast Iteration Loop (Offline)

**Flux:**
1. Generate 3 thumbnails
2. Score fiecare vs 12 criteria (Gemini Vision)
3. Identify failures (criteria = no)
4. Rewrite generation prompt pentru a fixa failures
5. Repeat

**Rezultate (10 iterații):**
- Start: 8.7/12 average score
- End: 11/12 single best thumbnail
- **Fără feedback uman**

**Examples of prompt improvements:**
- Iteration 1: "Add emotional intrigue"
- Iteration 3: "Make text much bigger and bolder"
- Iteration 5: "Simplify background, remove clutter"
- Iteration 8: "Increase visual hierarchy with directional cues"

**Beneficiu:** Better baseline ÎNAINTE de publish

---

### 4. Daily Slow Loop (Online Feedback)

**Flux complet:**
1. **Create thumbnail:** Using thumbnail skill + feedback memory rules
2. **Publish video**
3. **Wait 2-3 days:** YouTube Reporting API data available
4. **Pull CTR data:** Real click-through rate
5. **Score thumbnail:** Against 12 criteria
6. **Correlate:** High eval score + low CTR? = False positive
7. **Update feedback memory JSON:** New data-backed rules
8. **Next thumbnail starts from better baseline**

**Example correlation:**
- Thumbnail scored 11/12 but got 3.4% CTR → False positive
- Identify which criteria failed in practice
- Update rules: "Circular logos = avoid" or "Too much background detail = reduce"

---

### 5. Four Feedback Sources

**1. YouTube Reporting API (slow but accurate)**
- Real CTR post-publish
- 2-3 days latency
- Objective performance data

**2. ABC Split Tests (highest confidence)**
- Same video, same audience, different packaging
- YouTube picks winner automatically
- Controlled experiment = most reliable signal
- Extract winner/loser criteria → feed to memory JSON

**3. Human Feedback (during creation)**
- Author dă feedback pe iterații: "I like this, don't like that"
- Subjective dar rapid
- Helps refine taste preferences

**4. Fast Iterations (offline scoring)**
- Eval before publish
- Catches obvious failures
- Improves baseline

**Prioritizare:** ABC splits > YouTube API > Fast iterations > Human feedback

---

### 6. Self-Rewriting Prompts

**Mechanism:**
- Centralized `feedback_memory.json`
- Conține reguli data-backed (nu vibes)
- Auto-inject în generation prompts

**Exemplu feedback memory:**
```json
{
  "rules": [
    {"rule": "Use 'How to' in title", "confidence": 0.85, "source": "API"},
    {"rule": "Avoid circular logos", "confidence": 0.72, "source": "split_test"},
    {"rule": "Text size minimum 48px", "confidence": 0.91, "source": "iterations"}
  ],
  "winners": [...],
  "losers": [...]
}
```

**Every new thumbnail:**
- Loads feedback memory
- Starts from better baseline
- Incorporates all previous learnings

**Result:** Compounding improvements over time

---

## 💬 Quote-uri Relevante

> "It's never been clearer to me that we need to create these automated loops that improve itself every single time we do them."

> "You can't make up the eval criteria based on vibes. It has to be a yes/no answer."

> "The split test signal is the highest confidence signal because it is a controlled experiment. Same video, same audience but different packaging."

> "Every new thumbnail starts from a better baseline than the last."

> "The numbers are clear. The winners were using 'how to' in the titles 50% of the time, losers 23%."

> "It added specific features like make the text much bigger and bolder. It fixed the text again. It went from giving an average of 8.7 to a single 11 out of 12 in 10 iterations without giving me a single feedback."

> "That video got 29,000 views. But something interesting happened when I was checking the backend stats... the impression click-through rate of this video was 8%. But I have been making videos for 3 years in the AI space and some of my older videos are hitting 14%."

---

## 💡 Insights & Idei

### ✅ Pattern Universal - Aplicabil pentru Echo/Marius

#### 1. Autoresearch Loop = Eval Criteria Binare + Fast Iterations + Feedback Memory

**Core concept:**
- Sistem care își rescrie propriile prompt-uri bazat pe date reale
- Nu e specific pentru thumbnails - e un pattern universal

**Componentele:**
1. **Binary eval criteria** (yes/no, nu scale)
2. **Fast iterations** (offline, înainte de deploy)
3. **Slow feedback** (online, post-deploy)
4. **Feedback memory** (centralized rules, auto-inject)

**Aplicabilitate pentru Echo:**

**A. Morning/Evening Reports**
- **Eval criteria:** Include DONE items? Calendar <48h? Insights cu quotes? Lungime <500 cuvinte?
- **Fast iterations:** Generează 3 variante → Score → Îmbunătățește → Repeat × 5
- **Slow feedback:** Track email open time, reply engagement, ignored sections
- **Memory:** `memory/feedback/report-rules.json`

**B. YouTube Processing**
- **Eval criteria:** TL;DR <150 cuvinte? 5+ puncte cheie? 3+ quotes? Tags domeniu?
- **Fast iterations:** Procesează transcript → 3 variante summary → Score → Îmbunătățește
- **Slow feedback:** Care insights sunt [x] executate vs [ ] ignorate? Ce domenii au engagement?
- **Memory:** `memory/feedback/youtube-rules.json`

**C. Coaching Messages (08:00 & 23:00)**
- **Eval criteria:** Întrebare deschisă? Sub 100 cuvinte? Ton empathic? Legat de avatar?
- **Fast iterations:** 3 variante mesaj → Score tone/relevance → Îmbunătățește
- **Slow feedback:** Reply rate? Depth of Marius response? Engagement patterns?
- **Memory:** `memory/feedback/coaching-rules.json`

**D. Calendar Alerts**
- **Eval criteria:** Alert <2h înainte? Include location? Include context? Action clear?
- **Fast iterations:** N/A (simple alert)
- **Slow feedback:** Snooze vs confirm rate? Ce events primesc reply rapid?
- **Memory:** `memory/feedback/calendar-rules.json`

---

#### 2. Binary Eval Criteria >> Subjective Scoring

**De ce yes/no e mai bun decât scale 1-10:**
- **Eliminates ambiguity:** "Are 3+ quotes?" = clar; "Calitate insight 1-10?" = subiectiv
- **Easy to automate:** Regex, simple checks, no ML needed
- **Reproducible:** Same input → same score (nu dependent de mood)
- **Actionable:** "No" = știi exact ce să fix; "Score 6/10" = ce înseamnă?

**Pentru Echo:**
- ✅ "Include link preview?" vs ❌ "Cât de util e link-ul 1-10?"
- ✅ "Răspuns Marius <24h?" vs ❌ "Cât de urgent părea 1-10?"
- ✅ "Git uncommitted files?" vs ❌ "Cât de important e commit-ul 1-10?"

**Implementation simple:**
```python
def eval_binary_criteria(content, criteria_list):
    score = 0
    failures = []
    for criterion in criteria_list:
        if criterion['check'](content):
            score += 1
        else:
            failures.append(criterion['name'])
    return {'score': score, 'total': len(criteria_list), 'failures': failures}
```

---

#### 3. Fast Iterations (Offline) vs Slow Feedback (Online)

**Fast iterations (înainte de deploy):**
- **Scop:** Improve baseline fără a aștepta real-world data
- **Speed:** Seconds to minutes
- **Feedback:** Eval criteria (binary checks)
- **Beneficiu:** Start from better baseline

**Slow feedback (post-deploy):**
- **Scop:** Validate assumptions, correlate eval score cu real outcomes
- **Speed:** Hours to days
- **Feedback:** Real user behavior (CTR, reply rate, engagement)
- **Beneficiu:** Detect false positives, refine rules

**Pentru Ralph Workflow:**
- **Fast:** PRD generation → Self-review stories → Opus rewrite stories → Iterate (înainte de Claude Code implementation)
- **Slow:** Deploy → Track bugs, missed dependencies, story rewrites → Feed back to PRD templates

**Beneficiu combinat:**
- Fast = fewer bad deploys
- Slow = continuous refinement based on reality

---

#### 4. Multiple Feedback Sources = Higher Confidence

**YouTube case (4 surse):**
1. YouTube API (CTR real) - objective, slow
2. ABC split tests - highest confidence (controlled experiment)
3. Human feedback - subjective, fast
4. Fast iterations - eval-based, instant

**Prioritizare:** Controlled experiments > Objective metrics > Eval criteria > Human vibes

**Pentru Echo:**

**Morning Reports:**
1. **Email open tracking** (objective, medium speed) - "Open rate <1h?"
2. **Reply engagement** (objective, fast) - "Reply to which sections?"
3. **A/B test formats** (highest confidence) - "Weekly variation, track response"
4. **Self-eval** (instant) - "Binary criteria passed?"

**YouTube Processing:**
1. **Insights execution rate** (objective, slow) - "[x] vs [ ] ratio"
2. **Follow-up tasks** (objective, medium) - "Video generates task?"
3. **Domain relevance** (subjective, fast) - "Marius interest level?"
4. **Self-eval** (instant) - "TL;DR length, quotes count, tags present?"

**Implementare:**
```python
feedback_sources = [
    {'name': 'objective_metric', 'weight': 0.4},  # CTR, reply rate, etc.
    {'name': 'controlled_test', 'weight': 0.3},   # A/B splits
    {'name': 'eval_criteria', 'weight': 0.2},     # Binary checks
    {'name': 'human_feedback', 'weight': 0.1}     # Subjective
]

def aggregate_feedback(sources_data):
    weighted_score = sum(data['score'] * src['weight'] 
                        for src, data in zip(feedback_sources, sources_data))
    return weighted_score
```

---

#### 5. Self-Rewriting Prompts via Feedback JSON

**Pattern:**
- Centralized feedback memory (`feedback_memory.json`)
- Conține reguli data-backed (confidence score, source)
- Auto-inject în generation prompts
- Every iteration starts from better baseline

**Structure exemple:**
```json
{
  "domain": "morning_reports",
  "last_updated": "2026-03-21",
  "rules": [
    {
      "rule": "Include DONE items în primele 3 paragrafe",
      "confidence": 0.89,
      "source": "email_tracking",
      "rationale": "Open rate +42% când DONE e sus"
    },
    {
      "rule": "Calendar alerts <48h trebuie bold",
      "confidence": 0.76,
      "source": "reply_engagement",
      "rationale": "Confirm rate +28% când bold"
    },
    {
      "rule": "Evită secțiunea git status dacă fără uncommitted files",
      "confidence": 0.94,
      "source": "controlled_test",
      "rationale": "Reply time -15min când skip empty sections"
    }
  ],
  "anti_patterns": [
    {
      "pattern": "Liste bullet >10 items",
      "confidence": 0.81,
      "rationale": "Ignored rate +35%"
    }
  ]
}
```

**Auto-injection în prompt:**
```python
def enhance_prompt_with_feedback(base_prompt, feedback_json_path):
    feedback = json.load(open(feedback_json_path))
    
    # Filter high-confidence rules (>0.7)
    rules = [r for r in feedback['rules'] if r['confidence'] > 0.7]
    
    # Inject în prompt
    rules_text = "\n".join([f"- {r['rule']} (confidence: {r['confidence']:.0%})" 
                           for r in rules])
    
    enhanced = f"""{base_prompt}
    
DATA-BACKED RULES (apply these strictly):
{rules_text}

ANTI-PATTERNS (avoid these):
{chr(10).join([f"- {ap['pattern']}" for ap in feedback['anti_patterns']])}
"""
    return enhanced
```

**Beneficiu:** Compounding improvements - fiecare raport/insight/email e mai bun decât ultimul

---

#### 6. Data >> Vibes

**YouTube case:**
- Gap: 14% CTR (old thumbnails) vs 3.4% CTR (new) = **10 percentage points**
- Objective, măsurabil, imposibil de ignorat

**Pentru Marius:**

**A. Clienți noi (antreprenoriat)**
- **Vibe:** "Nu știu dacă o să funcționeze"
- **Data:** Track pitch proposals → response rate → conversion rate
- **Insight:** "Email pitch cu case study = 43% reply vs 12% fără"

**B. Support tickets ROA**
- **Vibe:** "Clientul ăsta e dificil"
- **Data:** Track ticket resolution time, follow-up questions, satisfaction
- **Insight:** "Video tutorial = 2.1 follow-ups vs 4.7 cu text explanation"

**C. ROA features**
- **Vibe:** "Feature X e important"
- **Data:** Track feature usage post-deploy (analytics)
- **Insight:** "Rapoarte noi = 78% monthly active users, export PDF = 12%"

**D. Echo rapoarte**
- **Vibe:** "Raportul ăsta e util"
- **Data:** Track open rate, reply time, sections clicked
- **Insight:** "Morning report open <1h = 64%, evening report = 31%"

**Implementation pentru tracking:**
```python
# În tools/analytics_tracker.py
class FeedbackTracker:
    def __init__(self, db_path='memory/feedback/analytics.db'):
        self.db = sqlite3.connect(db_path)
        
    def track_event(self, domain, event_type, metadata):
        """Track any feedback event"""
        self.db.execute("""
            INSERT INTO events (domain, type, metadata, timestamp)
            VALUES (?, ?, ?, ?)
        """, (domain, event_type, json.dumps(metadata), time.time()))
        
    def get_insights(self, domain, window_days=30):
        """Extract data-backed insights"""
        # Query events în window
        # Calculate rates, patterns, correlations
        # Return ranked insights cu confidence scores
```

---

### 🛠️ Implementare Practică pentru Echo

#### Plan A: Self-Improving Morning Reports

**Faza 1: Setup Eval Criteria (1 zi)**
```python
# În tools/morning_report_autoresearch.py
EVAL_CRITERIA = [
    {
        'name': 'done_items_present',
        'check': lambda report: bool(re.search(r'✅.*DONE', report)),
        'weight': 0.15
    },
    {
        'name': 'calendar_alerts_48h',
        'check': lambda report: bool(re.search(r'📅.*<48h', report)),
        'weight': 0.20
    },
    {
        'name': 'length_under_500',
        'check': lambda report: len(report.split()) < 500,
        'weight': 0.10
    },
    {
        'name': 'insights_with_quotes',
        'check': lambda report: report.count('"') >= 2,
        'weight': 0.15
    },
    {
        'name': 'git_status_if_needed',
        'check': lambda report: ('uncommitted' in report.lower()) or ('git status: clean' in report.lower()),
        'weight': 0.10
    },
    {
        'name': 'link_preview_offered',
        'check': lambda report: 'moltbot.tailf7372d.ts.net/echo/' in report,
        'weight': 0.10
    }
]
```

**Faza 2: Fast Iterations (integrate în daily-morning-checks)**
```python
def generate_report_with_autoresearch():
    # Load feedback memory
    feedback = load_feedback('memory/feedback/morning-report-rules.json')
    
    # Enhance base prompt
    prompt = enhance_prompt_with_feedback(BASE_REPORT_PROMPT, feedback)
    
    # Fast iteration loop (5 cycles)
    best_report = None
    best_score = 0
    
    for i in range(5):
        report = generate_report(prompt)
        eval_result = eval_binary_criteria(report, EVAL_CRITERIA)
        
        if eval_result['score'] > best_score:
            best_report = report
            best_score = eval_result['score']
        
        if eval_result['score'] >= 5:  # 83%+ pass
            break
        
        # Rewrite prompt based on failures
        prompt = fix_prompt(prompt, eval_result['failures'])
    
    return best_report
```

**Faza 3: Slow Feedback Tracking (background job)**
```python
# Nou job cron: feedback-tracker (daily 04:00)
def track_morning_report_feedback():
    """Rulează zilnic după morning report (03:00)"""
    # 1. Check email open time (Gmail API)
    open_time = get_email_open_time(latest_morning_report_id)
    
    # 2. Track reply engagement (Discord API)
    reply = get_discord_reply(channel='#echo', after=morning_report_time)
    
    # 3. Analyze patterns
    if open_time < 3600:  # <1h
        score_positive('fast_open')
    
    if reply and 'secțiune X' in reply:
        score_positive('section_X_engagement')
    
    # 4. Update feedback JSON
    update_feedback_memory('morning-report-rules.json', insights)
```

**Estimat efort:**
- Setup: 4-6h (eval criteria, fast iteration loop, feedback tracking)
- Maintenance: 0h (automat după setup)
- Benefit: Rapoarte mai relevante, mai puține follow-up questions

---

#### Plan B: YouTube Processing Quality Loop

**Faza 1: Eval Criteria**
```python
YOUTUBE_EVAL_CRITERIA = [
    {'name': 'tldr_under_150', 'check': lambda md: len(extract_tldr(md).split()) < 150},
    {'name': 'five_plus_points', 'check': lambda md: md.count('###') >= 5},
    {'name': 'three_plus_quotes', 'check': lambda md: md.count('> ') >= 3},
    {'name': 'insights_marked', 'check': lambda md: bool(re.search(r'[✅🔴]', md))},
    {'name': 'tags_present', 'check': lambda md: bool(re.search(r'@(work|health|growth)', md))},
    {'name': 'link_preview', 'check': lambda md: 'files.html#memory/kb/' in md}
]
```

**Faza 2: Fast Iterations în youtube_subs.py**
```python
def process_with_autoresearch(transcript, title):
    feedback = load_feedback('memory/feedback/youtube-rules.json')
    prompt = enhance_prompt(BASE_YOUTUBE_PROMPT, feedback)
    
    for i in range(3):
        summary_md = generate_summary(prompt, transcript, title)
        eval_result = eval_binary_criteria(summary_md, YOUTUBE_EVAL_CRITERIA)
        
        if eval_result['score'] >= 5:
            break
        
        prompt = fix_prompt(prompt, eval_result['failures'])
    
    return summary_md
```

**Faza 3: Slow Feedback (manual + automated)**
```python
# Track în memory/approved-tasks.md sau memory/YYYY-MM-DD.md
# Când Marius marchează insight ca [x] executat:
def track_insight_execution(insight_text, video_id):
    feedback_db.record_positive('insight_execution', {
        'video_id': video_id,
        'insight': insight_text,
        'domain': extract_domain(insight_text)  # @work, @health, etc.
    })

# Lunar review (sau la cerere):
def analyze_youtube_patterns():
    # Care domenii au highest [x] rate?
    # Care tipuri de insights sunt ignorate?
    # Ce lungime TL;DR are best engagement?
    # Update youtube-rules.json
```

**Estimat efort:**
- Setup: 3-4h
- Maintenance: 1h/lună (manual review patterns)
- Benefit: Insights mai actionable, mai puțin noise

---

#### Plan C: Ralph PRD Quality Loop

**Faza 1: PRD Eval Criteria**
```python
RALPH_PRD_CRITERIA = [
    {'name': 'use_cases_defined', 'check': lambda prd: '## Use Cases' in prd and prd.count('- ') >= 3},
    {'name': 'success_metrics', 'check': lambda prd: bool(re.search(r'(KPI|metric|measure)', prd, re.I))},
    {'name': 'tech_stack_specified', 'check': lambda prd: '## Tech Stack' in prd},
    {'name': 'stories_have_acceptance', 'check': lambda prd: prd.count('Acceptance Criteria:') >= 3},
    {'name': 'dependencies_identified', 'check': lambda prd: '## Dependencies' in prd},
    {'name': 'testing_strategy', 'check': lambda prd: bool(re.search(r'test', prd, re.I))}
]
```

**Faza 2: Fast Iterations (Opus + Sonnet collaboration)**
```python
# În tools/ralph_prd_generator.py
def create_prd_with_autoresearch(project_name, description):
    feedback = load_feedback('memory/feedback/ralph-prd-rules.json')
    
    for i in range(3):
        # Opus: Generate PRD
        prd_md = opus_generate_prd(project_name, description, feedback)
        
        # Sonnet: Evaluate vs criteria
        eval_result = sonnet_eval_prd(prd_md, RALPH_PRD_CRITERIA)
        
        if eval_result['score'] >= 5:
            break
        
        # Opus: Rewrite based on failures
        description = opus_enhance_brief(description, eval_result['failures'])
    
    # Generate prd.json
    prd_json = opus_prd_to_json(prd_md)
    
    return prd_md, prd_json
```

**Faza 3: Slow Feedback (post-implementation tracking)**
```python
# Nou fișier: memory/feedback/ralph-tracking.json
{
  "projects": [
    {
      "name": "roa-report-new",
      "prd_score": 6/6,
      "implementation": {
        "stories_completed_no_changes": 8,
        "stories_rewritten": 2,
        "bugs_post_deploy": 1,
        "missed_dependencies": 0
      },
      "quality_score": 0.87  # Derived metric
    }
  ]
}

# Lunar/per-project review:
def analyze_ralph_quality():
    # PRD score 6/6 → quality_score high? Correlation?
    # Ce criteria au highest correlation cu success?
    # Update ralph-prd-rules.json
```

**Estimat efort:**
- Setup: 5-7h (Opus+Sonnet collaboration complex)
- Maintenance: 1h/proiect (manual review post-deploy)
- Benefit: PRD-uri mai robuste, mai puține rewrites în implementation

---

### 🔴 Limitări și Atenționări

#### 1. Overfitting la Date Istorice

**Problema:**
- Optimizarea pentru "what worked în trecut" poate rata "what works NOW"
- Context change: audience, trends, Marius preferences evolve

**YouTube case:**
- Thumbnails de 3 ani în urmă: 14% CTR
- Optimizing pentru acele patterns poate fi outdated

**Soluție pentru Echo:**
- **Periodic baseline reset:** 1x/lună, ignore oldest 20% data
- **A/B test new approaches:** Don't only optimize current rules, try variations
- **Track rule age:** Decay confidence score over time (rule din 2025 = lower confidence în 2026)

**Implementation:**
```python
def decay_rule_confidence(rule, current_date):
    age_months = (current_date - rule['created']).months
    decay_factor = 0.95 ** age_months  # 5% decay/lună
    return rule['confidence'] * decay_factor
```

---

#### 2. False Positives în Eval Criteria

**Problema:**
- High eval score ≠ high real-world performance
- Eval criteria pot fi superficiale (checks form, not substance)

**YouTube case:**
- Thumbnail scored 11/12 dar got 3.4% CTR
- Binary criteria passed, dar real audience nu a dat click

**Soluție pentru Echo:**
- **MUST correlate eval score cu real outcomes**
- Track: eval_score vs reply_rate, open_time, engagement
- Identify false positives: high eval, low outcome
- Refine criteria: "What did eval miss?"

**Implementation:**
```python
def detect_false_positives(threshold_eval=0.8, threshold_outcome=0.5):
    """Find reports cu high eval score dar low real engagement"""
    false_positives = []
    for report in reports_db:
        if report['eval_score'] > threshold_eval and report['outcome_score'] < threshold_outcome:
            false_positives.append(report)
            # Analyze: ce criteria au trecut dar nu ar fi trebuit?
    return false_positives
```

---

#### 3. Slow Feedback Loop Latency

**Problema:**
- YouTube API = 2-3 zile delay pentru CTR data
- Slow to adapt la real-time changes

**Pentru Echo:**
- **Email feedback:** Gmail API = same day (mai rapid)
- **Discord replies:** Instant (dacă Marius răspunde)
- **BUT:** Reply patterns = variabile (mood, busy-ness, etc.)

**Soluție:**
- **Combine fast + slow signals:**
  - Fast: Email open time (hours)
  - Slow: Reply engagement patterns (days)
  - Very slow: Monthly satisfaction review
- **Weight fast signals lower** (more noise), slow signals higher (more signal)

---

#### 4. Human-in-the-Loop Bias

**Problema:**
- Dacă Marius dă feedback bazat pe vibes (nu data), loop se degradează
- "Mi-a plăcut raportul ăsta" ≠ "Raportul ăsta m-a ajutat să iau decizie"

**Soluție:**
- **Prioritize objective metrics** > human feedback
- **Ask specific questions:** "Ce secțiune a fost cea mai utilă?" (nu "Ți-a plăcut?")
- **Track behavior, not opinions:** Open time, reply time, action taken (mai reliable decât "rating 1-10")

**Implementation:**
```python
feedback_weights = {
    'objective_metric': 0.5,     # CTR, reply time, open rate
    'controlled_test': 0.3,      # A/B splits
    'eval_criteria': 0.15,       # Binary checks
    'human_feedback': 0.05       # Lowest weight (most biased)
}
```

---

### 📊 Metrici de Success pentru Echo

Dacă implementăm autoresearch loop pentru rapoarte/insights/emails:

#### Baseline (Current - Unknown)

**Morning Reports:**
- Generation time: ~5min (estimate)
- Marius reply rate: ?% (not tracked)
- Open time: ?h (not tracked)
- Sections clicked: ? (not tracked)

**YouTube Processing:**
- Generation time: ~3min (estimate)
- Insights execution rate: ?% [x] vs [ ] (not systematically tracked)
- Follow-up tasks: ? (not tracked)

**Email Communication:**
- Draft time: ~2min (estimate)
- Reply time: ?h average (not tracked)
- Action items completed: ?% (not tracked)

---

#### Target (Cu Autoresearch - 3 Months)

**Morning Reports:**
- Generation time: <3min (fast iterations reduce back-and-forth)
- Marius reply rate: >70% (mai relevant content)
- Open time: <1h for 80% of reports (better subject lines)
- Sections clicked: Track + optimize (feedback JSON)

**YouTube Processing:**
- Generation time: <2min (optimized prompts)
- Insights execution rate: >50% [x] (mai actionable)
- Follow-up tasks: 30%+ of relevant videos (better filtering)

**Email Communication:**
- Draft time: <1min (learned patterns)
- Reply time: <12h average (clearer action items)
- Action items completed: >80% (better framing)

---

#### Tracking Implementation

**Nou: `memory/feedback/analytics.db` (SQLite)**
```sql
CREATE TABLE events (
    id INTEGER PRIMARY KEY,
    domain TEXT,           -- 'morning_report', 'youtube', 'email'
    event_type TEXT,       -- 'open', 'reply', 'execute_insight', 'click'
    metadata JSON,         -- {report_id, section, timestamp, etc.}
    timestamp INTEGER
);

CREATE TABLE feedback_rules (
    id INTEGER PRIMARY KEY,
    domain TEXT,
    rule TEXT,
    confidence REAL,
    source TEXT,           -- 'api', 'split_test', 'human', 'eval'
    rationale TEXT,
    created INTEGER,
    last_updated INTEGER
);
```

**Dashboard tracking:**
```python
# Extend dashboard/index.html cu Analytics tab
# Show:
# - Eval score trends over time (improving?)
# - Outcome metrics (reply rate, open time, execution rate)
# - Correlation: eval vs outcome (detect false positives)
# - Top rules by confidence
# - Recent feedback events
```

---

## 🔗 Link-uri & Resurse

- **Video:** https://youtu.be/0PO6m09_80Q
- **Karpathy Autoresearch:** https://github.com/karpathy/autoresearch (referenced)
- **YouTube Reporting API:** https://developers.google.com/youtube/reporting
- **YouTube Analytics API:** https://developers.google.com/youtube/analytics
- **Gemini Vision:** Used for thumbnail scoring

**Cohort mentioned:**
- Live build session: March 23rd (Monday & Thursday)
- Free community: ~1,000 members, "AI agent classroom"
- Python file: 1,000 lines (shared în community)

---

## 📝 Note Suplimentare

### Gap Performance Original
- **Old thumbnails (3 ani):** 14-18% CTR (best performers)
- **Recent thumbnails:** 3.4-9% CTR
- **Gap:** 10+ percentage points → motivație pentru autoresearch

### ABC Split Test Winner
- **A (abstract/text-heavy):** 51% preference
- **B (mid):** 28%
- **C (author face):** 21% (lowest - "That hurts")

### Implementation Details
- **Airtable:** Used pentru storing video data (500+ videos)
- **Gemini Vision:** Scoring thumbnails vs criteria
- **1,000 lines Python:** Entire autoresearch system
- **Fast iterations:** 10 cycles, 3 thumbnails each = 30 total generated
- **Final winner:** 11/12 score (doar 1 criterion failed)

### Author's Other Systems
- **AI clone for social media:** Instagram/Facebook reels (35k views, automated)
- **Thumbnail skill:** Existing skill în OpenClaw/Claude Code pentru quick generation

---

**Status:** [ ] Discută cu Marius: Implementăm autoresearch pentru Echo rapoarte?  
**Priority:** High - pattern universal, beneficiu mare pe termen lung  
**Estimat efort:** 10-15h setup initial (toate 3 domenii), apoi automat  
**ROI:** Compounding improvements - fiecare raport/insight mai bun decât ultimul