29 KiB
Claude Code + Karpathy's Autoresearch = INSANE RESULTS!
URL: https://youtu.be/0PO6m09_80Q
Durată: 12:44
Data salvare: 2026-03-21
Tags: @work @scout #autoresearch #self-improving #automation #machine-learning
📋 TL;DR
Autorul construiește un sistem self-improving pentru thumbnails YouTube inspirat din autoresearch loop-ul lui Andrej Karpathy. Sistemul trage date reale (500+ video-uri, CTR din YouTube API), creează eval criteria binare (12 întrebări yes/no despre thumbnail quality), iterează rapid (10 cicluri × 3 thumbnails), își îmbunătățește propriile prompt-uri automat, apoi rulează zilnic cu 4 surse de feedback: YouTube Reporting API (CTR real post-publish), ABC split tests (cel mai high-confidence signal), human feedback din iterații, și fast iterations (offline scoring). Rezultat: creștere de la 8.7/12 la 11/12 eval score în 10 iterații fără intervenție umană. Gap de performanță: thumbnail-uri vechi ~14% CTR vs noi ~3.4% CTR → sistemul învață din ce a funcționat înainte.
🎯 Puncte cheie
1. Data-Driven Eval Criteria (Not Vibes)
Process:
- Scraped 180+ video-uri din ultimii 3 ani
- Grupate în 3 categorii: winners (high CTR), losers (low CTR), mid
- Analiză statistică pe titluri și thumbnails
Data-backed patterns:
- "How to" în titlu: 50% winners vs 23% losers
- "Tutorial": 44% winners vs 13% losers
- Negative framing (stop, forget, RIP): doar 6% în winners
- Exclamation marks: loser criteria
- Questions în titlu: loser criteria
Concluzie: Criteriile bazate pe CTR real, nu pe "mi se pare că arată bine"
2. 12 Binary Eval Questions
Format: Yes/No (nu scale 1-10), eliminates ambiguity
Visual Anchor & Attention:
- Single dominant visual anchor (face/graphic) taking 20%+ of frame?
- Anchor conveys emotion/energy/intrigue?
- Directional cues present (arrows, pointing)?
Text & Readability: 4. Text limited to 1-4 bold, high-contrast words? 5. Text readable at mobile size?
Composition: 6. Background simple and uncluttered? 7. Clear visual hierarchy? 8. Shows result/output/transformation (not just tool/process)?
Branding: 9. One or more recognizable logos present?
Packaging (pentru title): 10-12. Similar criteria pentru titlu (how-to, tutorial, avoid negative framing)
Why binary: Consistent scoring, automatable, reproducible
3. Fast Iteration Loop (Offline)
Flux:
- Generate 3 thumbnails
- Score fiecare vs 12 criteria (Gemini Vision)
- Identify failures (criteria = no)
- Rewrite generation prompt pentru a fixa failures
- Repeat
Rezultate (10 iterații):
- Start: 8.7/12 average score
- End: 11/12 single best thumbnail
- Fără feedback uman
Examples of prompt improvements:
- Iteration 1: "Add emotional intrigue"
- Iteration 3: "Make text much bigger and bolder"
- Iteration 5: "Simplify background, remove clutter"
- Iteration 8: "Increase visual hierarchy with directional cues"
Beneficiu: Better baseline ÎNAINTE de publish
4. Daily Slow Loop (Online Feedback)
Flux complet:
- Create thumbnail: Using thumbnail skill + feedback memory rules
- Publish video
- Wait 2-3 days: YouTube Reporting API data available
- Pull CTR data: Real click-through rate
- Score thumbnail: Against 12 criteria
- Correlate: High eval score + low CTR? = False positive
- Update feedback memory JSON: New data-backed rules
- Next thumbnail starts from better baseline
Example correlation:
- Thumbnail scored 11/12 but got 3.4% CTR → False positive
- Identify which criteria failed in practice
- Update rules: "Circular logos = avoid" or "Too much background detail = reduce"
5. Four Feedback Sources
1. YouTube Reporting API (slow but accurate)
- Real CTR post-publish
- 2-3 days latency
- Objective performance data
2. ABC Split Tests (highest confidence)
- Same video, same audience, different packaging
- YouTube picks winner automatically
- Controlled experiment = most reliable signal
- Extract winner/loser criteria → feed to memory JSON
3. Human Feedback (during creation)
- Author dă feedback pe iterații: "I like this, don't like that"
- Subjective dar rapid
- Helps refine taste preferences
4. Fast Iterations (offline scoring)
- Eval before publish
- Catches obvious failures
- Improves baseline
Prioritizare: ABC splits > YouTube API > Fast iterations > Human feedback
6. Self-Rewriting Prompts
Mechanism:
- Centralized
feedback_memory.json - Conține reguli data-backed (nu vibes)
- Auto-inject în generation prompts
Exemplu feedback memory:
{
"rules": [
{"rule": "Use 'How to' in title", "confidence": 0.85, "source": "API"},
{"rule": "Avoid circular logos", "confidence": 0.72, "source": "split_test"},
{"rule": "Text size minimum 48px", "confidence": 0.91, "source": "iterations"}
],
"winners": [...],
"losers": [...]
}
Every new thumbnail:
- Loads feedback memory
- Starts from better baseline
- Incorporates all previous learnings
Result: Compounding improvements over time
💬 Quote-uri Relevante
"It's never been clearer to me that we need to create these automated loops that improve itself every single time we do them."
"You can't make up the eval criteria based on vibes. It has to be a yes/no answer."
"The split test signal is the highest confidence signal because it is a controlled experiment. Same video, same audience but different packaging."
"Every new thumbnail starts from a better baseline than the last."
"The numbers are clear. The winners were using 'how to' in the titles 50% of the time, losers 23%."
"It added specific features like make the text much bigger and bolder. It fixed the text again. It went from giving an average of 8.7 to a single 11 out of 12 in 10 iterations without giving me a single feedback."
"That video got 29,000 views. But something interesting happened when I was checking the backend stats... the impression click-through rate of this video was 8%. But I have been making videos for 3 years in the AI space and some of my older videos are hitting 14%."
💡 Insights & Idei
✅ Pattern Universal - Aplicabil pentru Echo/Marius
1. Autoresearch Loop = Eval Criteria Binare + Fast Iterations + Feedback Memory
Core concept:
- Sistem care își rescrie propriile prompt-uri bazat pe date reale
- Nu e specific pentru thumbnails - e un pattern universal
Componentele:
- Binary eval criteria (yes/no, nu scale)
- Fast iterations (offline, înainte de deploy)
- Slow feedback (online, post-deploy)
- Feedback memory (centralized rules, auto-inject)
Aplicabilitate pentru Echo:
A. Morning/Evening Reports
- Eval criteria: Include DONE items? Calendar <48h? Insights cu quotes? Lungime <500 cuvinte?
- Fast iterations: Generează 3 variante → Score → Îmbunătățește → Repeat × 5
- Slow feedback: Track email open time, reply engagement, ignored sections
- Memory:
memory/feedback/report-rules.json
B. YouTube Processing
- Eval criteria: TL;DR <150 cuvinte? 5+ puncte cheie? 3+ quotes? Tags domeniu?
- Fast iterations: Procesează transcript → 3 variante summary → Score → Îmbunătățește
- Slow feedback: Care insights sunt [x] executate vs [ ] ignorate? Ce domenii au engagement?
- Memory:
memory/feedback/youtube-rules.json
C. Coaching Messages (08:00 & 23:00)
- Eval criteria: Întrebare deschisă? Sub 100 cuvinte? Ton empathic? Legat de avatar?
- Fast iterations: 3 variante mesaj → Score tone/relevance → Îmbunătățește
- Slow feedback: Reply rate? Depth of Marius response? Engagement patterns?
- Memory:
memory/feedback/coaching-rules.json
D. Calendar Alerts
- Eval criteria: Alert <2h înainte? Include location? Include context? Action clear?
- Fast iterations: N/A (simple alert)
- Slow feedback: Snooze vs confirm rate? Ce events primesc reply rapid?
- Memory:
memory/feedback/calendar-rules.json
2. Binary Eval Criteria >> Subjective Scoring
De ce yes/no e mai bun decât scale 1-10:
- Eliminates ambiguity: "Are 3+ quotes?" = clar; "Calitate insight 1-10?" = subiectiv
- Easy to automate: Regex, simple checks, no ML needed
- Reproducible: Same input → same score (nu dependent de mood)
- Actionable: "No" = știi exact ce să fix; "Score 6/10" = ce înseamnă?
Pentru Echo:
- ✅ "Include link preview?" vs ❌ "Cât de util e link-ul 1-10?"
- ✅ "Răspuns Marius <24h?" vs ❌ "Cât de urgent părea 1-10?"
- ✅ "Git uncommitted files?" vs ❌ "Cât de important e commit-ul 1-10?"
Implementation simple:
def eval_binary_criteria(content, criteria_list):
score = 0
failures = []
for criterion in criteria_list:
if criterion['check'](content):
score += 1
else:
failures.append(criterion['name'])
return {'score': score, 'total': len(criteria_list), 'failures': failures}
3. Fast Iterations (Offline) vs Slow Feedback (Online)
Fast iterations (înainte de deploy):
- Scop: Improve baseline fără a aștepta real-world data
- Speed: Seconds to minutes
- Feedback: Eval criteria (binary checks)
- Beneficiu: Start from better baseline
Slow feedback (post-deploy):
- Scop: Validate assumptions, correlate eval score cu real outcomes
- Speed: Hours to days
- Feedback: Real user behavior (CTR, reply rate, engagement)
- Beneficiu: Detect false positives, refine rules
Pentru Ralph Workflow:
- Fast: PRD generation → Self-review stories → Opus rewrite stories → Iterate (înainte de Claude Code implementation)
- Slow: Deploy → Track bugs, missed dependencies, story rewrites → Feed back to PRD templates
Beneficiu combinat:
- Fast = fewer bad deploys
- Slow = continuous refinement based on reality
4. Multiple Feedback Sources = Higher Confidence
YouTube case (4 surse):
- YouTube API (CTR real) - objective, slow
- ABC split tests - highest confidence (controlled experiment)
- Human feedback - subjective, fast
- Fast iterations - eval-based, instant
Prioritizare: Controlled experiments > Objective metrics > Eval criteria > Human vibes
Pentru Echo:
Morning Reports:
- Email open tracking (objective, medium speed) - "Open rate <1h?"
- Reply engagement (objective, fast) - "Reply to which sections?"
- A/B test formats (highest confidence) - "Weekly variation, track response"
- Self-eval (instant) - "Binary criteria passed?"
YouTube Processing:
- Insights execution rate (objective, slow) - "[x] vs [ ] ratio"
- Follow-up tasks (objective, medium) - "Video generates task?"
- Domain relevance (subjective, fast) - "Marius interest level?"
- Self-eval (instant) - "TL;DR length, quotes count, tags present?"
Implementare:
feedback_sources = [
{'name': 'objective_metric', 'weight': 0.4}, # CTR, reply rate, etc.
{'name': 'controlled_test', 'weight': 0.3}, # A/B splits
{'name': 'eval_criteria', 'weight': 0.2}, # Binary checks
{'name': 'human_feedback', 'weight': 0.1} # Subjective
]
def aggregate_feedback(sources_data):
weighted_score = sum(data['score'] * src['weight']
for src, data in zip(feedback_sources, sources_data))
return weighted_score
5. Self-Rewriting Prompts via Feedback JSON
Pattern:
- Centralized feedback memory (
feedback_memory.json) - Conține reguli data-backed (confidence score, source)
- Auto-inject în generation prompts
- Every iteration starts from better baseline
Structure exemple:
{
"domain": "morning_reports",
"last_updated": "2026-03-21",
"rules": [
{
"rule": "Include DONE items în primele 3 paragrafe",
"confidence": 0.89,
"source": "email_tracking",
"rationale": "Open rate +42% când DONE e sus"
},
{
"rule": "Calendar alerts <48h trebuie bold",
"confidence": 0.76,
"source": "reply_engagement",
"rationale": "Confirm rate +28% când bold"
},
{
"rule": "Evită secțiunea git status dacă fără uncommitted files",
"confidence": 0.94,
"source": "controlled_test",
"rationale": "Reply time -15min când skip empty sections"
}
],
"anti_patterns": [
{
"pattern": "Liste bullet >10 items",
"confidence": 0.81,
"rationale": "Ignored rate +35%"
}
]
}
Auto-injection în prompt:
def enhance_prompt_with_feedback(base_prompt, feedback_json_path):
feedback = json.load(open(feedback_json_path))
# Filter high-confidence rules (>0.7)
rules = [r for r in feedback['rules'] if r['confidence'] > 0.7]
# Inject în prompt
rules_text = "\n".join([f"- {r['rule']} (confidence: {r['confidence']:.0%})"
for r in rules])
enhanced = f"""{base_prompt}
DATA-BACKED RULES (apply these strictly):
{rules_text}
ANTI-PATTERNS (avoid these):
{chr(10).join([f"- {ap['pattern']}" for ap in feedback['anti_patterns']])}
"""
return enhanced
Beneficiu: Compounding improvements - fiecare raport/insight/email e mai bun decât ultimul
6. Data >> Vibes
YouTube case:
- Gap: 14% CTR (old thumbnails) vs 3.4% CTR (new) = 10 percentage points
- Objective, măsurabil, imposibil de ignorat
Pentru Marius:
A. Clienți noi (antreprenoriat)
- Vibe: "Nu știu dacă o să funcționeze"
- Data: Track pitch proposals → response rate → conversion rate
- Insight: "Email pitch cu case study = 43% reply vs 12% fără"
B. Support tickets ROA
- Vibe: "Clientul ăsta e dificil"
- Data: Track ticket resolution time, follow-up questions, satisfaction
- Insight: "Video tutorial = 2.1 follow-ups vs 4.7 cu text explanation"
C. ROA features
- Vibe: "Feature X e important"
- Data: Track feature usage post-deploy (analytics)
- Insight: "Rapoarte noi = 78% monthly active users, export PDF = 12%"
D. Echo rapoarte
- Vibe: "Raportul ăsta e util"
- Data: Track open rate, reply time, sections clicked
- Insight: "Morning report open <1h = 64%, evening report = 31%"
Implementation pentru tracking:
# În tools/analytics_tracker.py
class FeedbackTracker:
def __init__(self, db_path='memory/feedback/analytics.db'):
self.db = sqlite3.connect(db_path)
def track_event(self, domain, event_type, metadata):
"""Track any feedback event"""
self.db.execute("""
INSERT INTO events (domain, type, metadata, timestamp)
VALUES (?, ?, ?, ?)
""", (domain, event_type, json.dumps(metadata), time.time()))
def get_insights(self, domain, window_days=30):
"""Extract data-backed insights"""
# Query events în window
# Calculate rates, patterns, correlations
# Return ranked insights cu confidence scores
🛠️ Implementare Practică pentru Echo
Plan A: Self-Improving Morning Reports
Faza 1: Setup Eval Criteria (1 zi)
# În tools/morning_report_autoresearch.py
EVAL_CRITERIA = [
{
'name': 'done_items_present',
'check': lambda report: bool(re.search(r'✅.*DONE', report)),
'weight': 0.15
},
{
'name': 'calendar_alerts_48h',
'check': lambda report: bool(re.search(r'📅.*<48h', report)),
'weight': 0.20
},
{
'name': 'length_under_500',
'check': lambda report: len(report.split()) < 500,
'weight': 0.10
},
{
'name': 'insights_with_quotes',
'check': lambda report: report.count('"') >= 2,
'weight': 0.15
},
{
'name': 'git_status_if_needed',
'check': lambda report: ('uncommitted' in report.lower()) or ('git status: clean' in report.lower()),
'weight': 0.10
},
{
'name': 'link_preview_offered',
'check': lambda report: 'moltbot.tailf7372d.ts.net/echo/' in report,
'weight': 0.10
}
]
Faza 2: Fast Iterations (integrate în daily-morning-checks)
def generate_report_with_autoresearch():
# Load feedback memory
feedback = load_feedback('memory/feedback/morning-report-rules.json')
# Enhance base prompt
prompt = enhance_prompt_with_feedback(BASE_REPORT_PROMPT, feedback)
# Fast iteration loop (5 cycles)
best_report = None
best_score = 0
for i in range(5):
report = generate_report(prompt)
eval_result = eval_binary_criteria(report, EVAL_CRITERIA)
if eval_result['score'] > best_score:
best_report = report
best_score = eval_result['score']
if eval_result['score'] >= 5: # 83%+ pass
break
# Rewrite prompt based on failures
prompt = fix_prompt(prompt, eval_result['failures'])
return best_report
Faza 3: Slow Feedback Tracking (background job)
# Nou job cron: feedback-tracker (daily 04:00)
def track_morning_report_feedback():
"""Rulează zilnic după morning report (03:00)"""
# 1. Check email open time (Gmail API)
open_time = get_email_open_time(latest_morning_report_id)
# 2. Track reply engagement (Discord API)
reply = get_discord_reply(channel='#echo', after=morning_report_time)
# 3. Analyze patterns
if open_time < 3600: # <1h
score_positive('fast_open')
if reply and 'secțiune X' in reply:
score_positive('section_X_engagement')
# 4. Update feedback JSON
update_feedback_memory('morning-report-rules.json', insights)
Estimat efort:
- Setup: 4-6h (eval criteria, fast iteration loop, feedback tracking)
- Maintenance: 0h (automat după setup)
- Benefit: Rapoarte mai relevante, mai puține follow-up questions
Plan B: YouTube Processing Quality Loop
Faza 1: Eval Criteria
YOUTUBE_EVAL_CRITERIA = [
{'name': 'tldr_under_150', 'check': lambda md: len(extract_tldr(md).split()) < 150},
{'name': 'five_plus_points', 'check': lambda md: md.count('###') >= 5},
{'name': 'three_plus_quotes', 'check': lambda md: md.count('> ') >= 3},
{'name': 'insights_marked', 'check': lambda md: bool(re.search(r'[✅🔴]', md))},
{'name': 'tags_present', 'check': lambda md: bool(re.search(r'@(work|health|growth)', md))},
{'name': 'link_preview', 'check': lambda md: 'files.html#memory/kb/' in md}
]
Faza 2: Fast Iterations în youtube_subs.py
def process_with_autoresearch(transcript, title):
feedback = load_feedback('memory/feedback/youtube-rules.json')
prompt = enhance_prompt(BASE_YOUTUBE_PROMPT, feedback)
for i in range(3):
summary_md = generate_summary(prompt, transcript, title)
eval_result = eval_binary_criteria(summary_md, YOUTUBE_EVAL_CRITERIA)
if eval_result['score'] >= 5:
break
prompt = fix_prompt(prompt, eval_result['failures'])
return summary_md
Faza 3: Slow Feedback (manual + automated)
# Track în memory/approved-tasks.md sau memory/YYYY-MM-DD.md
# Când Marius marchează insight ca [x] executat:
def track_insight_execution(insight_text, video_id):
feedback_db.record_positive('insight_execution', {
'video_id': video_id,
'insight': insight_text,
'domain': extract_domain(insight_text) # @work, @health, etc.
})
# Lunar review (sau la cerere):
def analyze_youtube_patterns():
# Care domenii au highest [x] rate?
# Care tipuri de insights sunt ignorate?
# Ce lungime TL;DR are best engagement?
# Update youtube-rules.json
Estimat efort:
- Setup: 3-4h
- Maintenance: 1h/lună (manual review patterns)
- Benefit: Insights mai actionable, mai puțin noise
Plan C: Ralph PRD Quality Loop
Faza 1: PRD Eval Criteria
RALPH_PRD_CRITERIA = [
{'name': 'use_cases_defined', 'check': lambda prd: '## Use Cases' in prd and prd.count('- ') >= 3},
{'name': 'success_metrics', 'check': lambda prd: bool(re.search(r'(KPI|metric|measure)', prd, re.I))},
{'name': 'tech_stack_specified', 'check': lambda prd: '## Tech Stack' in prd},
{'name': 'stories_have_acceptance', 'check': lambda prd: prd.count('Acceptance Criteria:') >= 3},
{'name': 'dependencies_identified', 'check': lambda prd: '## Dependencies' in prd},
{'name': 'testing_strategy', 'check': lambda prd: bool(re.search(r'test', prd, re.I))}
]
Faza 2: Fast Iterations (Opus + Sonnet collaboration)
# În tools/ralph_prd_generator.py
def create_prd_with_autoresearch(project_name, description):
feedback = load_feedback('memory/feedback/ralph-prd-rules.json')
for i in range(3):
# Opus: Generate PRD
prd_md = opus_generate_prd(project_name, description, feedback)
# Sonnet: Evaluate vs criteria
eval_result = sonnet_eval_prd(prd_md, RALPH_PRD_CRITERIA)
if eval_result['score'] >= 5:
break
# Opus: Rewrite based on failures
description = opus_enhance_brief(description, eval_result['failures'])
# Generate prd.json
prd_json = opus_prd_to_json(prd_md)
return prd_md, prd_json
Faza 3: Slow Feedback (post-implementation tracking)
# Nou fișier: memory/feedback/ralph-tracking.json
{
"projects": [
{
"name": "roa-report-new",
"prd_score": 6/6,
"implementation": {
"stories_completed_no_changes": 8,
"stories_rewritten": 2,
"bugs_post_deploy": 1,
"missed_dependencies": 0
},
"quality_score": 0.87 # Derived metric
}
]
}
# Lunar/per-project review:
def analyze_ralph_quality():
# PRD score 6/6 → quality_score high? Correlation?
# Ce criteria au highest correlation cu success?
# Update ralph-prd-rules.json
Estimat efort:
- Setup: 5-7h (Opus+Sonnet collaboration complex)
- Maintenance: 1h/proiect (manual review post-deploy)
- Benefit: PRD-uri mai robuste, mai puține rewrites în implementation
🔴 Limitări și Atenționări
1. Overfitting la Date Istorice
Problema:
- Optimizarea pentru "what worked în trecut" poate rata "what works NOW"
- Context change: audience, trends, Marius preferences evolve
YouTube case:
- Thumbnails de 3 ani în urmă: 14% CTR
- Optimizing pentru acele patterns poate fi outdated
Soluție pentru Echo:
- Periodic baseline reset: 1x/lună, ignore oldest 20% data
- A/B test new approaches: Don't only optimize current rules, try variations
- Track rule age: Decay confidence score over time (rule din 2025 = lower confidence în 2026)
Implementation:
def decay_rule_confidence(rule, current_date):
age_months = (current_date - rule['created']).months
decay_factor = 0.95 ** age_months # 5% decay/lună
return rule['confidence'] * decay_factor
2. False Positives în Eval Criteria
Problema:
- High eval score ≠ high real-world performance
- Eval criteria pot fi superficiale (checks form, not substance)
YouTube case:
- Thumbnail scored 11/12 dar got 3.4% CTR
- Binary criteria passed, dar real audience nu a dat click
Soluție pentru Echo:
- MUST correlate eval score cu real outcomes
- Track: eval_score vs reply_rate, open_time, engagement
- Identify false positives: high eval, low outcome
- Refine criteria: "What did eval miss?"
Implementation:
def detect_false_positives(threshold_eval=0.8, threshold_outcome=0.5):
"""Find reports cu high eval score dar low real engagement"""
false_positives = []
for report in reports_db:
if report['eval_score'] > threshold_eval and report['outcome_score'] < threshold_outcome:
false_positives.append(report)
# Analyze: ce criteria au trecut dar nu ar fi trebuit?
return false_positives
3. Slow Feedback Loop Latency
Problema:
- YouTube API = 2-3 zile delay pentru CTR data
- Slow to adapt la real-time changes
Pentru Echo:
- Email feedback: Gmail API = same day (mai rapid)
- Discord replies: Instant (dacă Marius răspunde)
- BUT: Reply patterns = variabile (mood, busy-ness, etc.)
Soluție:
- Combine fast + slow signals:
- Fast: Email open time (hours)
- Slow: Reply engagement patterns (days)
- Very slow: Monthly satisfaction review
- Weight fast signals lower (more noise), slow signals higher (more signal)
4. Human-in-the-Loop Bias
Problema:
- Dacă Marius dă feedback bazat pe vibes (nu data), loop se degradează
- "Mi-a plăcut raportul ăsta" ≠ "Raportul ăsta m-a ajutat să iau decizie"
Soluție:
- Prioritize objective metrics > human feedback
- Ask specific questions: "Ce secțiune a fost cea mai utilă?" (nu "Ți-a plăcut?")
- Track behavior, not opinions: Open time, reply time, action taken (mai reliable decât "rating 1-10")
Implementation:
feedback_weights = {
'objective_metric': 0.5, # CTR, reply time, open rate
'controlled_test': 0.3, # A/B splits
'eval_criteria': 0.15, # Binary checks
'human_feedback': 0.05 # Lowest weight (most biased)
}
📊 Metrici de Success pentru Echo
Dacă implementăm autoresearch loop pentru rapoarte/insights/emails:
Baseline (Current - Unknown)
Morning Reports:
- Generation time: ~5min (estimate)
- Marius reply rate: ?% (not tracked)
- Open time: ?h (not tracked)
- Sections clicked: ? (not tracked)
YouTube Processing:
- Generation time: ~3min (estimate)
- Insights execution rate: ?% [x] vs [ ] (not systematically tracked)
- Follow-up tasks: ? (not tracked)
Email Communication:
- Draft time: ~2min (estimate)
- Reply time: ?h average (not tracked)
- Action items completed: ?% (not tracked)
Target (Cu Autoresearch - 3 Months)
Morning Reports:
- Generation time: <3min (fast iterations reduce back-and-forth)
- Marius reply rate: >70% (mai relevant content)
- Open time: <1h for 80% of reports (better subject lines)
- Sections clicked: Track + optimize (feedback JSON)
YouTube Processing:
- Generation time: <2min (optimized prompts)
- Insights execution rate: >50% [x] (mai actionable)
- Follow-up tasks: 30%+ of relevant videos (better filtering)
Email Communication:
- Draft time: <1min (learned patterns)
- Reply time: <12h average (clearer action items)
- Action items completed: >80% (better framing)
Tracking Implementation
Nou: memory/feedback/analytics.db (SQLite)
CREATE TABLE events (
id INTEGER PRIMARY KEY,
domain TEXT, -- 'morning_report', 'youtube', 'email'
event_type TEXT, -- 'open', 'reply', 'execute_insight', 'click'
metadata JSON, -- {report_id, section, timestamp, etc.}
timestamp INTEGER
);
CREATE TABLE feedback_rules (
id INTEGER PRIMARY KEY,
domain TEXT,
rule TEXT,
confidence REAL,
source TEXT, -- 'api', 'split_test', 'human', 'eval'
rationale TEXT,
created INTEGER,
last_updated INTEGER
);
Dashboard tracking:
# Extend dashboard/index.html cu Analytics tab
# Show:
# - Eval score trends over time (improving?)
# - Outcome metrics (reply rate, open time, execution rate)
# - Correlation: eval vs outcome (detect false positives)
# - Top rules by confidence
# - Recent feedback events
🔗 Link-uri & Resurse
- Video: https://youtu.be/0PO6m09_80Q
- Karpathy Autoresearch: https://github.com/karpathy/autoresearch (referenced)
- YouTube Reporting API: https://developers.google.com/youtube/reporting
- YouTube Analytics API: https://developers.google.com/youtube/analytics
- Gemini Vision: Used for thumbnail scoring
Cohort mentioned:
- Live build session: March 23rd (Monday & Thursday)
- Free community: ~1,000 members, "AI agent classroom"
- Python file: 1,000 lines (shared în community)
📝 Note Suplimentare
Gap Performance Original
- Old thumbnails (3 ani): 14-18% CTR (best performers)
- Recent thumbnails: 3.4-9% CTR
- Gap: 10+ percentage points → motivație pentru autoresearch
ABC Split Test Winner
- A (abstract/text-heavy): 51% preference
- B (mid): 28%
- C (author face): 21% (lowest - "That hurts")
Implementation Details
- Airtable: Used pentru storing video data (500+ videos)
- Gemini Vision: Scoring thumbnails vs criteria
- 1,000 lines Python: Entire autoresearch system
- Fast iterations: 10 cycles, 3 thumbnails each = 30 total generated
- Final winner: 11/12 score (doar 1 criterion failed)
Author's Other Systems
- AI clone for social media: Instagram/Facebook reels (35k views, automated)
- Thumbnail skill: Existing skill în OpenClaw/Claude Code pentru quick generation
Status: [ ] Discută cu Marius: Implementăm autoresearch pentru Echo rapoarte?
Priority: High - pattern universal, beneficiu mare pe termen lung
Estimat efort: 10-15h setup initial (toate 3 domenii), apoi automat
ROI: Compounding improvements - fiecare raport/insight mai bun decât ultimul