Update emory, memory (+22 ~1)
This commit is contained in:
893
memory/kb/youtube/2026-03-21-autoresearch-thumbnails.md
Normal file
893
memory/kb/youtube/2026-03-21-autoresearch-thumbnails.md
Normal file
@@ -0,0 +1,893 @@
|
||||
# Claude Code + Karpathy's Autoresearch = INSANE RESULTS!
|
||||
|
||||
**URL:** https://youtu.be/0PO6m09_80Q
|
||||
**Durată:** 12:44
|
||||
**Data salvare:** 2026-03-21
|
||||
**Tags:** @work @scout #autoresearch #self-improving #automation #machine-learning
|
||||
|
||||
---
|
||||
|
||||
## 📋 TL;DR
|
||||
|
||||
Autorul construiește un sistem self-improving pentru thumbnails YouTube inspirat din autoresearch loop-ul lui Andrej Karpathy. Sistemul trage date reale (500+ video-uri, CTR din YouTube API), creează eval criteria binare (12 întrebări yes/no despre thumbnail quality), iterează rapid (10 cicluri × 3 thumbnails), își îmbunătățește propriile prompt-uri automat, apoi rulează zilnic cu 4 surse de feedback: YouTube Reporting API (CTR real post-publish), ABC split tests (cel mai high-confidence signal), human feedback din iterații, și fast iterations (offline scoring). Rezultat: creștere de la 8.7/12 la 11/12 eval score în 10 iterații fără intervenție umană. Gap de performanță: thumbnail-uri vechi ~14% CTR vs noi ~3.4% CTR → sistemul învață din ce a funcționat înainte.
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Puncte cheie
|
||||
|
||||
### 1. Data-Driven Eval Criteria (Not Vibes)
|
||||
|
||||
**Process:**
|
||||
- Scraped 180+ video-uri din ultimii 3 ani
|
||||
- Grupate în 3 categorii: winners (high CTR), losers (low CTR), mid
|
||||
- Analiză statistică pe titluri și thumbnails
|
||||
|
||||
**Data-backed patterns:**
|
||||
- **"How to"** în titlu: 50% winners vs 23% losers
|
||||
- **"Tutorial"**: 44% winners vs 13% losers
|
||||
- **Negative framing** (stop, forget, RIP): doar 6% în winners
|
||||
- **Exclamation marks**: loser criteria
|
||||
- **Questions în titlu**: loser criteria
|
||||
|
||||
**Concluzie:** Criteriile bazate pe CTR real, nu pe "mi se pare că arată bine"
|
||||
|
||||
---
|
||||
|
||||
### 2. 12 Binary Eval Questions
|
||||
|
||||
Format: **Yes/No** (nu scale 1-10), eliminates ambiguity
|
||||
|
||||
**Visual Anchor & Attention:**
|
||||
1. Single dominant visual anchor (face/graphic) taking 20%+ of frame?
|
||||
2. Anchor conveys emotion/energy/intrigue?
|
||||
3. Directional cues present (arrows, pointing)?
|
||||
|
||||
**Text & Readability:**
|
||||
4. Text limited to 1-4 bold, high-contrast words?
|
||||
5. Text readable at mobile size?
|
||||
|
||||
**Composition:**
|
||||
6. Background simple and uncluttered?
|
||||
7. Clear visual hierarchy?
|
||||
8. Shows result/output/transformation (not just tool/process)?
|
||||
|
||||
**Branding:**
|
||||
9. One or more recognizable logos present?
|
||||
|
||||
**Packaging (pentru title):**
|
||||
10-12. Similar criteria pentru titlu (how-to, tutorial, avoid negative framing)
|
||||
|
||||
**Why binary:** Consistent scoring, automatable, reproducible
|
||||
|
||||
---
|
||||
|
||||
### 3. Fast Iteration Loop (Offline)
|
||||
|
||||
**Flux:**
|
||||
1. Generate 3 thumbnails
|
||||
2. Score fiecare vs 12 criteria (Gemini Vision)
|
||||
3. Identify failures (criteria = no)
|
||||
4. Rewrite generation prompt pentru a fixa failures
|
||||
5. Repeat
|
||||
|
||||
**Rezultate (10 iterații):**
|
||||
- Start: 8.7/12 average score
|
||||
- End: 11/12 single best thumbnail
|
||||
- **Fără feedback uman**
|
||||
|
||||
**Examples of prompt improvements:**
|
||||
- Iteration 1: "Add emotional intrigue"
|
||||
- Iteration 3: "Make text much bigger and bolder"
|
||||
- Iteration 5: "Simplify background, remove clutter"
|
||||
- Iteration 8: "Increase visual hierarchy with directional cues"
|
||||
|
||||
**Beneficiu:** Better baseline ÎNAINTE de publish
|
||||
|
||||
---
|
||||
|
||||
### 4. Daily Slow Loop (Online Feedback)
|
||||
|
||||
**Flux complet:**
|
||||
1. **Create thumbnail:** Using thumbnail skill + feedback memory rules
|
||||
2. **Publish video**
|
||||
3. **Wait 2-3 days:** YouTube Reporting API data available
|
||||
4. **Pull CTR data:** Real click-through rate
|
||||
5. **Score thumbnail:** Against 12 criteria
|
||||
6. **Correlate:** High eval score + low CTR? = False positive
|
||||
7. **Update feedback memory JSON:** New data-backed rules
|
||||
8. **Next thumbnail starts from better baseline**
|
||||
|
||||
**Example correlation:**
|
||||
- Thumbnail scored 11/12 but got 3.4% CTR → False positive
|
||||
- Identify which criteria failed in practice
|
||||
- Update rules: "Circular logos = avoid" or "Too much background detail = reduce"
|
||||
|
||||
---
|
||||
|
||||
### 5. Four Feedback Sources
|
||||
|
||||
**1. YouTube Reporting API (slow but accurate)**
|
||||
- Real CTR post-publish
|
||||
- 2-3 days latency
|
||||
- Objective performance data
|
||||
|
||||
**2. ABC Split Tests (highest confidence)**
|
||||
- Same video, same audience, different packaging
|
||||
- YouTube picks winner automatically
|
||||
- Controlled experiment = most reliable signal
|
||||
- Extract winner/loser criteria → feed to memory JSON
|
||||
|
||||
**3. Human Feedback (during creation)**
|
||||
- Author dă feedback pe iterații: "I like this, don't like that"
|
||||
- Subjective dar rapid
|
||||
- Helps refine taste preferences
|
||||
|
||||
**4. Fast Iterations (offline scoring)**
|
||||
- Eval before publish
|
||||
- Catches obvious failures
|
||||
- Improves baseline
|
||||
|
||||
**Prioritizare:** ABC splits > YouTube API > Fast iterations > Human feedback
|
||||
|
||||
---
|
||||
|
||||
### 6. Self-Rewriting Prompts
|
||||
|
||||
**Mechanism:**
|
||||
- Centralized `feedback_memory.json`
|
||||
- Conține reguli data-backed (nu vibes)
|
||||
- Auto-inject în generation prompts
|
||||
|
||||
**Exemplu feedback memory:**
|
||||
```json
|
||||
{
|
||||
"rules": [
|
||||
{"rule": "Use 'How to' in title", "confidence": 0.85, "source": "API"},
|
||||
{"rule": "Avoid circular logos", "confidence": 0.72, "source": "split_test"},
|
||||
{"rule": "Text size minimum 48px", "confidence": 0.91, "source": "iterations"}
|
||||
],
|
||||
"winners": [...],
|
||||
"losers": [...]
|
||||
}
|
||||
```
|
||||
|
||||
**Every new thumbnail:**
|
||||
- Loads feedback memory
|
||||
- Starts from better baseline
|
||||
- Incorporates all previous learnings
|
||||
|
||||
**Result:** Compounding improvements over time
|
||||
|
||||
---
|
||||
|
||||
## 💬 Quote-uri Relevante
|
||||
|
||||
> "It's never been clearer to me that we need to create these automated loops that improve itself every single time we do them."
|
||||
|
||||
> "You can't make up the eval criteria based on vibes. It has to be a yes/no answer."
|
||||
|
||||
> "The split test signal is the highest confidence signal because it is a controlled experiment. Same video, same audience but different packaging."
|
||||
|
||||
> "Every new thumbnail starts from a better baseline than the last."
|
||||
|
||||
> "The numbers are clear. The winners were using 'how to' in the titles 50% of the time, losers 23%."
|
||||
|
||||
> "It added specific features like make the text much bigger and bolder. It fixed the text again. It went from giving an average of 8.7 to a single 11 out of 12 in 10 iterations without giving me a single feedback."
|
||||
|
||||
> "That video got 29,000 views. But something interesting happened when I was checking the backend stats... the impression click-through rate of this video was 8%. But I have been making videos for 3 years in the AI space and some of my older videos are hitting 14%."
|
||||
|
||||
---
|
||||
|
||||
## 💡 Insights & Idei
|
||||
|
||||
### ✅ Pattern Universal - Aplicabil pentru Echo/Marius
|
||||
|
||||
#### 1. Autoresearch Loop = Eval Criteria Binare + Fast Iterations + Feedback Memory
|
||||
|
||||
**Core concept:**
|
||||
- Sistem care își rescrie propriile prompt-uri bazat pe date reale
|
||||
- Nu e specific pentru thumbnails - e un pattern universal
|
||||
|
||||
**Componentele:**
|
||||
1. **Binary eval criteria** (yes/no, nu scale)
|
||||
2. **Fast iterations** (offline, înainte de deploy)
|
||||
3. **Slow feedback** (online, post-deploy)
|
||||
4. **Feedback memory** (centralized rules, auto-inject)
|
||||
|
||||
**Aplicabilitate pentru Echo:**
|
||||
|
||||
**A. Morning/Evening Reports**
|
||||
- **Eval criteria:** Include DONE items? Calendar <48h? Insights cu quotes? Lungime <500 cuvinte?
|
||||
- **Fast iterations:** Generează 3 variante → Score → Îmbunătățește → Repeat × 5
|
||||
- **Slow feedback:** Track email open time, reply engagement, ignored sections
|
||||
- **Memory:** `memory/feedback/report-rules.json`
|
||||
|
||||
**B. YouTube Processing**
|
||||
- **Eval criteria:** TL;DR <150 cuvinte? 5+ puncte cheie? 3+ quotes? Tags domeniu?
|
||||
- **Fast iterations:** Procesează transcript → 3 variante summary → Score → Îmbunătățește
|
||||
- **Slow feedback:** Care insights sunt [x] executate vs [ ] ignorate? Ce domenii au engagement?
|
||||
- **Memory:** `memory/feedback/youtube-rules.json`
|
||||
|
||||
**C. Coaching Messages (08:00 & 23:00)**
|
||||
- **Eval criteria:** Întrebare deschisă? Sub 100 cuvinte? Ton empathic? Legat de avatar?
|
||||
- **Fast iterations:** 3 variante mesaj → Score tone/relevance → Îmbunătățește
|
||||
- **Slow feedback:** Reply rate? Depth of Marius response? Engagement patterns?
|
||||
- **Memory:** `memory/feedback/coaching-rules.json`
|
||||
|
||||
**D. Calendar Alerts**
|
||||
- **Eval criteria:** Alert <2h înainte? Include location? Include context? Action clear?
|
||||
- **Fast iterations:** N/A (simple alert)
|
||||
- **Slow feedback:** Snooze vs confirm rate? Ce events primesc reply rapid?
|
||||
- **Memory:** `memory/feedback/calendar-rules.json`
|
||||
|
||||
---
|
||||
|
||||
#### 2. Binary Eval Criteria >> Subjective Scoring
|
||||
|
||||
**De ce yes/no e mai bun decât scale 1-10:**
|
||||
- **Eliminates ambiguity:** "Are 3+ quotes?" = clar; "Calitate insight 1-10?" = subiectiv
|
||||
- **Easy to automate:** Regex, simple checks, no ML needed
|
||||
- **Reproducible:** Same input → same score (nu dependent de mood)
|
||||
- **Actionable:** "No" = știi exact ce să fix; "Score 6/10" = ce înseamnă?
|
||||
|
||||
**Pentru Echo:**
|
||||
- ✅ "Include link preview?" vs ❌ "Cât de util e link-ul 1-10?"
|
||||
- ✅ "Răspuns Marius <24h?" vs ❌ "Cât de urgent părea 1-10?"
|
||||
- ✅ "Git uncommitted files?" vs ❌ "Cât de important e commit-ul 1-10?"
|
||||
|
||||
**Implementation simple:**
|
||||
```python
|
||||
def eval_binary_criteria(content, criteria_list):
|
||||
score = 0
|
||||
failures = []
|
||||
for criterion in criteria_list:
|
||||
if criterion['check'](content):
|
||||
score += 1
|
||||
else:
|
||||
failures.append(criterion['name'])
|
||||
return {'score': score, 'total': len(criteria_list), 'failures': failures}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### 3. Fast Iterations (Offline) vs Slow Feedback (Online)
|
||||
|
||||
**Fast iterations (înainte de deploy):**
|
||||
- **Scop:** Improve baseline fără a aștepta real-world data
|
||||
- **Speed:** Seconds to minutes
|
||||
- **Feedback:** Eval criteria (binary checks)
|
||||
- **Beneficiu:** Start from better baseline
|
||||
|
||||
**Slow feedback (post-deploy):**
|
||||
- **Scop:** Validate assumptions, correlate eval score cu real outcomes
|
||||
- **Speed:** Hours to days
|
||||
- **Feedback:** Real user behavior (CTR, reply rate, engagement)
|
||||
- **Beneficiu:** Detect false positives, refine rules
|
||||
|
||||
**Pentru Ralph Workflow:**
|
||||
- **Fast:** PRD generation → Self-review stories → Opus rewrite stories → Iterate (înainte de Claude Code implementation)
|
||||
- **Slow:** Deploy → Track bugs, missed dependencies, story rewrites → Feed back to PRD templates
|
||||
|
||||
**Beneficiu combinat:**
|
||||
- Fast = fewer bad deploys
|
||||
- Slow = continuous refinement based on reality
|
||||
|
||||
---
|
||||
|
||||
#### 4. Multiple Feedback Sources = Higher Confidence
|
||||
|
||||
**YouTube case (4 surse):**
|
||||
1. YouTube API (CTR real) - objective, slow
|
||||
2. ABC split tests - highest confidence (controlled experiment)
|
||||
3. Human feedback - subjective, fast
|
||||
4. Fast iterations - eval-based, instant
|
||||
|
||||
**Prioritizare:** Controlled experiments > Objective metrics > Eval criteria > Human vibes
|
||||
|
||||
**Pentru Echo:**
|
||||
|
||||
**Morning Reports:**
|
||||
1. **Email open tracking** (objective, medium speed) - "Open rate <1h?"
|
||||
2. **Reply engagement** (objective, fast) - "Reply to which sections?"
|
||||
3. **A/B test formats** (highest confidence) - "Weekly variation, track response"
|
||||
4. **Self-eval** (instant) - "Binary criteria passed?"
|
||||
|
||||
**YouTube Processing:**
|
||||
1. **Insights execution rate** (objective, slow) - "[x] vs [ ] ratio"
|
||||
2. **Follow-up tasks** (objective, medium) - "Video generates task?"
|
||||
3. **Domain relevance** (subjective, fast) - "Marius interest level?"
|
||||
4. **Self-eval** (instant) - "TL;DR length, quotes count, tags present?"
|
||||
|
||||
**Implementare:**
|
||||
```python
|
||||
feedback_sources = [
|
||||
{'name': 'objective_metric', 'weight': 0.4}, # CTR, reply rate, etc.
|
||||
{'name': 'controlled_test', 'weight': 0.3}, # A/B splits
|
||||
{'name': 'eval_criteria', 'weight': 0.2}, # Binary checks
|
||||
{'name': 'human_feedback', 'weight': 0.1} # Subjective
|
||||
]
|
||||
|
||||
def aggregate_feedback(sources_data):
|
||||
weighted_score = sum(data['score'] * src['weight']
|
||||
for src, data in zip(feedback_sources, sources_data))
|
||||
return weighted_score
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### 5. Self-Rewriting Prompts via Feedback JSON
|
||||
|
||||
**Pattern:**
|
||||
- Centralized feedback memory (`feedback_memory.json`)
|
||||
- Conține reguli data-backed (confidence score, source)
|
||||
- Auto-inject în generation prompts
|
||||
- Every iteration starts from better baseline
|
||||
|
||||
**Structure exemple:**
|
||||
```json
|
||||
{
|
||||
"domain": "morning_reports",
|
||||
"last_updated": "2026-03-21",
|
||||
"rules": [
|
||||
{
|
||||
"rule": "Include DONE items în primele 3 paragrafe",
|
||||
"confidence": 0.89,
|
||||
"source": "email_tracking",
|
||||
"rationale": "Open rate +42% când DONE e sus"
|
||||
},
|
||||
{
|
||||
"rule": "Calendar alerts <48h trebuie bold",
|
||||
"confidence": 0.76,
|
||||
"source": "reply_engagement",
|
||||
"rationale": "Confirm rate +28% când bold"
|
||||
},
|
||||
{
|
||||
"rule": "Evită secțiunea git status dacă fără uncommitted files",
|
||||
"confidence": 0.94,
|
||||
"source": "controlled_test",
|
||||
"rationale": "Reply time -15min când skip empty sections"
|
||||
}
|
||||
],
|
||||
"anti_patterns": [
|
||||
{
|
||||
"pattern": "Liste bullet >10 items",
|
||||
"confidence": 0.81,
|
||||
"rationale": "Ignored rate +35%"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Auto-injection în prompt:**
|
||||
```python
|
||||
def enhance_prompt_with_feedback(base_prompt, feedback_json_path):
|
||||
feedback = json.load(open(feedback_json_path))
|
||||
|
||||
# Filter high-confidence rules (>0.7)
|
||||
rules = [r for r in feedback['rules'] if r['confidence'] > 0.7]
|
||||
|
||||
# Inject în prompt
|
||||
rules_text = "\n".join([f"- {r['rule']} (confidence: {r['confidence']:.0%})"
|
||||
for r in rules])
|
||||
|
||||
enhanced = f"""{base_prompt}
|
||||
|
||||
DATA-BACKED RULES (apply these strictly):
|
||||
{rules_text}
|
||||
|
||||
ANTI-PATTERNS (avoid these):
|
||||
{chr(10).join([f"- {ap['pattern']}" for ap in feedback['anti_patterns']])}
|
||||
"""
|
||||
return enhanced
|
||||
```
|
||||
|
||||
**Beneficiu:** Compounding improvements - fiecare raport/insight/email e mai bun decât ultimul
|
||||
|
||||
---
|
||||
|
||||
#### 6. Data >> Vibes
|
||||
|
||||
**YouTube case:**
|
||||
- Gap: 14% CTR (old thumbnails) vs 3.4% CTR (new) = **10 percentage points**
|
||||
- Objective, măsurabil, imposibil de ignorat
|
||||
|
||||
**Pentru Marius:**
|
||||
|
||||
**A. Clienți noi (antreprenoriat)**
|
||||
- **Vibe:** "Nu știu dacă o să funcționeze"
|
||||
- **Data:** Track pitch proposals → response rate → conversion rate
|
||||
- **Insight:** "Email pitch cu case study = 43% reply vs 12% fără"
|
||||
|
||||
**B. Support tickets ROA**
|
||||
- **Vibe:** "Clientul ăsta e dificil"
|
||||
- **Data:** Track ticket resolution time, follow-up questions, satisfaction
|
||||
- **Insight:** "Video tutorial = 2.1 follow-ups vs 4.7 cu text explanation"
|
||||
|
||||
**C. ROA features**
|
||||
- **Vibe:** "Feature X e important"
|
||||
- **Data:** Track feature usage post-deploy (analytics)
|
||||
- **Insight:** "Rapoarte noi = 78% monthly active users, export PDF = 12%"
|
||||
|
||||
**D. Echo rapoarte**
|
||||
- **Vibe:** "Raportul ăsta e util"
|
||||
- **Data:** Track open rate, reply time, sections clicked
|
||||
- **Insight:** "Morning report open <1h = 64%, evening report = 31%"
|
||||
|
||||
**Implementation pentru tracking:**
|
||||
```python
|
||||
# În tools/analytics_tracker.py
|
||||
class FeedbackTracker:
|
||||
def __init__(self, db_path='memory/feedback/analytics.db'):
|
||||
self.db = sqlite3.connect(db_path)
|
||||
|
||||
def track_event(self, domain, event_type, metadata):
|
||||
"""Track any feedback event"""
|
||||
self.db.execute("""
|
||||
INSERT INTO events (domain, type, metadata, timestamp)
|
||||
VALUES (?, ?, ?, ?)
|
||||
""", (domain, event_type, json.dumps(metadata), time.time()))
|
||||
|
||||
def get_insights(self, domain, window_days=30):
|
||||
"""Extract data-backed insights"""
|
||||
# Query events în window
|
||||
# Calculate rates, patterns, correlations
|
||||
# Return ranked insights cu confidence scores
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 🛠️ Implementare Practică pentru Echo
|
||||
|
||||
#### Plan A: Self-Improving Morning Reports
|
||||
|
||||
**Faza 1: Setup Eval Criteria (1 zi)**
|
||||
```python
|
||||
# În tools/morning_report_autoresearch.py
|
||||
EVAL_CRITERIA = [
|
||||
{
|
||||
'name': 'done_items_present',
|
||||
'check': lambda report: bool(re.search(r'✅.*DONE', report)),
|
||||
'weight': 0.15
|
||||
},
|
||||
{
|
||||
'name': 'calendar_alerts_48h',
|
||||
'check': lambda report: bool(re.search(r'📅.*<48h', report)),
|
||||
'weight': 0.20
|
||||
},
|
||||
{
|
||||
'name': 'length_under_500',
|
||||
'check': lambda report: len(report.split()) < 500,
|
||||
'weight': 0.10
|
||||
},
|
||||
{
|
||||
'name': 'insights_with_quotes',
|
||||
'check': lambda report: report.count('"') >= 2,
|
||||
'weight': 0.15
|
||||
},
|
||||
{
|
||||
'name': 'git_status_if_needed',
|
||||
'check': lambda report: ('uncommitted' in report.lower()) or ('git status: clean' in report.lower()),
|
||||
'weight': 0.10
|
||||
},
|
||||
{
|
||||
'name': 'link_preview_offered',
|
||||
'check': lambda report: 'moltbot.tailf7372d.ts.net/echo/' in report,
|
||||
'weight': 0.10
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
**Faza 2: Fast Iterations (integrate în daily-morning-checks)**
|
||||
```python
|
||||
def generate_report_with_autoresearch():
|
||||
# Load feedback memory
|
||||
feedback = load_feedback('memory/feedback/morning-report-rules.json')
|
||||
|
||||
# Enhance base prompt
|
||||
prompt = enhance_prompt_with_feedback(BASE_REPORT_PROMPT, feedback)
|
||||
|
||||
# Fast iteration loop (5 cycles)
|
||||
best_report = None
|
||||
best_score = 0
|
||||
|
||||
for i in range(5):
|
||||
report = generate_report(prompt)
|
||||
eval_result = eval_binary_criteria(report, EVAL_CRITERIA)
|
||||
|
||||
if eval_result['score'] > best_score:
|
||||
best_report = report
|
||||
best_score = eval_result['score']
|
||||
|
||||
if eval_result['score'] >= 5: # 83%+ pass
|
||||
break
|
||||
|
||||
# Rewrite prompt based on failures
|
||||
prompt = fix_prompt(prompt, eval_result['failures'])
|
||||
|
||||
return best_report
|
||||
```
|
||||
|
||||
**Faza 3: Slow Feedback Tracking (background job)**
|
||||
```python
|
||||
# Nou job cron: feedback-tracker (daily 04:00)
|
||||
def track_morning_report_feedback():
|
||||
"""Rulează zilnic după morning report (03:00)"""
|
||||
# 1. Check email open time (Gmail API)
|
||||
open_time = get_email_open_time(latest_morning_report_id)
|
||||
|
||||
# 2. Track reply engagement (Discord API)
|
||||
reply = get_discord_reply(channel='#echo', after=morning_report_time)
|
||||
|
||||
# 3. Analyze patterns
|
||||
if open_time < 3600: # <1h
|
||||
score_positive('fast_open')
|
||||
|
||||
if reply and 'secțiune X' in reply:
|
||||
score_positive('section_X_engagement')
|
||||
|
||||
# 4. Update feedback JSON
|
||||
update_feedback_memory('morning-report-rules.json', insights)
|
||||
```
|
||||
|
||||
**Estimat efort:**
|
||||
- Setup: 4-6h (eval criteria, fast iteration loop, feedback tracking)
|
||||
- Maintenance: 0h (automat după setup)
|
||||
- Benefit: Rapoarte mai relevante, mai puține follow-up questions
|
||||
|
||||
---
|
||||
|
||||
#### Plan B: YouTube Processing Quality Loop
|
||||
|
||||
**Faza 1: Eval Criteria**
|
||||
```python
|
||||
YOUTUBE_EVAL_CRITERIA = [
|
||||
{'name': 'tldr_under_150', 'check': lambda md: len(extract_tldr(md).split()) < 150},
|
||||
{'name': 'five_plus_points', 'check': lambda md: md.count('###') >= 5},
|
||||
{'name': 'three_plus_quotes', 'check': lambda md: md.count('> ') >= 3},
|
||||
{'name': 'insights_marked', 'check': lambda md: bool(re.search(r'[✅🔴]', md))},
|
||||
{'name': 'tags_present', 'check': lambda md: bool(re.search(r'@(work|health|growth)', md))},
|
||||
{'name': 'link_preview', 'check': lambda md: 'files.html#memory/kb/' in md}
|
||||
]
|
||||
```
|
||||
|
||||
**Faza 2: Fast Iterations în youtube_subs.py**
|
||||
```python
|
||||
def process_with_autoresearch(transcript, title):
|
||||
feedback = load_feedback('memory/feedback/youtube-rules.json')
|
||||
prompt = enhance_prompt(BASE_YOUTUBE_PROMPT, feedback)
|
||||
|
||||
for i in range(3):
|
||||
summary_md = generate_summary(prompt, transcript, title)
|
||||
eval_result = eval_binary_criteria(summary_md, YOUTUBE_EVAL_CRITERIA)
|
||||
|
||||
if eval_result['score'] >= 5:
|
||||
break
|
||||
|
||||
prompt = fix_prompt(prompt, eval_result['failures'])
|
||||
|
||||
return summary_md
|
||||
```
|
||||
|
||||
**Faza 3: Slow Feedback (manual + automated)**
|
||||
```python
|
||||
# Track în memory/approved-tasks.md sau memory/YYYY-MM-DD.md
|
||||
# Când Marius marchează insight ca [x] executat:
|
||||
def track_insight_execution(insight_text, video_id):
|
||||
feedback_db.record_positive('insight_execution', {
|
||||
'video_id': video_id,
|
||||
'insight': insight_text,
|
||||
'domain': extract_domain(insight_text) # @work, @health, etc.
|
||||
})
|
||||
|
||||
# Lunar review (sau la cerere):
|
||||
def analyze_youtube_patterns():
|
||||
# Care domenii au highest [x] rate?
|
||||
# Care tipuri de insights sunt ignorate?
|
||||
# Ce lungime TL;DR are best engagement?
|
||||
# Update youtube-rules.json
|
||||
```
|
||||
|
||||
**Estimat efort:**
|
||||
- Setup: 3-4h
|
||||
- Maintenance: 1h/lună (manual review patterns)
|
||||
- Benefit: Insights mai actionable, mai puțin noise
|
||||
|
||||
---
|
||||
|
||||
#### Plan C: Ralph PRD Quality Loop
|
||||
|
||||
**Faza 1: PRD Eval Criteria**
|
||||
```python
|
||||
RALPH_PRD_CRITERIA = [
|
||||
{'name': 'use_cases_defined', 'check': lambda prd: '## Use Cases' in prd and prd.count('- ') >= 3},
|
||||
{'name': 'success_metrics', 'check': lambda prd: bool(re.search(r'(KPI|metric|measure)', prd, re.I))},
|
||||
{'name': 'tech_stack_specified', 'check': lambda prd: '## Tech Stack' in prd},
|
||||
{'name': 'stories_have_acceptance', 'check': lambda prd: prd.count('Acceptance Criteria:') >= 3},
|
||||
{'name': 'dependencies_identified', 'check': lambda prd: '## Dependencies' in prd},
|
||||
{'name': 'testing_strategy', 'check': lambda prd: bool(re.search(r'test', prd, re.I))}
|
||||
]
|
||||
```
|
||||
|
||||
**Faza 2: Fast Iterations (Opus + Sonnet collaboration)**
|
||||
```python
|
||||
# În tools/ralph_prd_generator.py
|
||||
def create_prd_with_autoresearch(project_name, description):
|
||||
feedback = load_feedback('memory/feedback/ralph-prd-rules.json')
|
||||
|
||||
for i in range(3):
|
||||
# Opus: Generate PRD
|
||||
prd_md = opus_generate_prd(project_name, description, feedback)
|
||||
|
||||
# Sonnet: Evaluate vs criteria
|
||||
eval_result = sonnet_eval_prd(prd_md, RALPH_PRD_CRITERIA)
|
||||
|
||||
if eval_result['score'] >= 5:
|
||||
break
|
||||
|
||||
# Opus: Rewrite based on failures
|
||||
description = opus_enhance_brief(description, eval_result['failures'])
|
||||
|
||||
# Generate prd.json
|
||||
prd_json = opus_prd_to_json(prd_md)
|
||||
|
||||
return prd_md, prd_json
|
||||
```
|
||||
|
||||
**Faza 3: Slow Feedback (post-implementation tracking)**
|
||||
```python
|
||||
# Nou fișier: memory/feedback/ralph-tracking.json
|
||||
{
|
||||
"projects": [
|
||||
{
|
||||
"name": "roa-report-new",
|
||||
"prd_score": 6/6,
|
||||
"implementation": {
|
||||
"stories_completed_no_changes": 8,
|
||||
"stories_rewritten": 2,
|
||||
"bugs_post_deploy": 1,
|
||||
"missed_dependencies": 0
|
||||
},
|
||||
"quality_score": 0.87 # Derived metric
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# Lunar/per-project review:
|
||||
def analyze_ralph_quality():
|
||||
# PRD score 6/6 → quality_score high? Correlation?
|
||||
# Ce criteria au highest correlation cu success?
|
||||
# Update ralph-prd-rules.json
|
||||
```
|
||||
|
||||
**Estimat efort:**
|
||||
- Setup: 5-7h (Opus+Sonnet collaboration complex)
|
||||
- Maintenance: 1h/proiect (manual review post-deploy)
|
||||
- Benefit: PRD-uri mai robuste, mai puține rewrites în implementation
|
||||
|
||||
---
|
||||
|
||||
### 🔴 Limitări și Atenționări
|
||||
|
||||
#### 1. Overfitting la Date Istorice
|
||||
|
||||
**Problema:**
|
||||
- Optimizarea pentru "what worked în trecut" poate rata "what works NOW"
|
||||
- Context change: audience, trends, Marius preferences evolve
|
||||
|
||||
**YouTube case:**
|
||||
- Thumbnails de 3 ani în urmă: 14% CTR
|
||||
- Optimizing pentru acele patterns poate fi outdated
|
||||
|
||||
**Soluție pentru Echo:**
|
||||
- **Periodic baseline reset:** 1x/lună, ignore oldest 20% data
|
||||
- **A/B test new approaches:** Don't only optimize current rules, try variations
|
||||
- **Track rule age:** Decay confidence score over time (rule din 2025 = lower confidence în 2026)
|
||||
|
||||
**Implementation:**
|
||||
```python
|
||||
def decay_rule_confidence(rule, current_date):
|
||||
age_months = (current_date - rule['created']).months
|
||||
decay_factor = 0.95 ** age_months # 5% decay/lună
|
||||
return rule['confidence'] * decay_factor
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### 2. False Positives în Eval Criteria
|
||||
|
||||
**Problema:**
|
||||
- High eval score ≠ high real-world performance
|
||||
- Eval criteria pot fi superficiale (checks form, not substance)
|
||||
|
||||
**YouTube case:**
|
||||
- Thumbnail scored 11/12 dar got 3.4% CTR
|
||||
- Binary criteria passed, dar real audience nu a dat click
|
||||
|
||||
**Soluție pentru Echo:**
|
||||
- **MUST correlate eval score cu real outcomes**
|
||||
- Track: eval_score vs reply_rate, open_time, engagement
|
||||
- Identify false positives: high eval, low outcome
|
||||
- Refine criteria: "What did eval miss?"
|
||||
|
||||
**Implementation:**
|
||||
```python
|
||||
def detect_false_positives(threshold_eval=0.8, threshold_outcome=0.5):
|
||||
"""Find reports cu high eval score dar low real engagement"""
|
||||
false_positives = []
|
||||
for report in reports_db:
|
||||
if report['eval_score'] > threshold_eval and report['outcome_score'] < threshold_outcome:
|
||||
false_positives.append(report)
|
||||
# Analyze: ce criteria au trecut dar nu ar fi trebuit?
|
||||
return false_positives
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
#### 3. Slow Feedback Loop Latency
|
||||
|
||||
**Problema:**
|
||||
- YouTube API = 2-3 zile delay pentru CTR data
|
||||
- Slow to adapt la real-time changes
|
||||
|
||||
**Pentru Echo:**
|
||||
- **Email feedback:** Gmail API = same day (mai rapid)
|
||||
- **Discord replies:** Instant (dacă Marius răspunde)
|
||||
- **BUT:** Reply patterns = variabile (mood, busy-ness, etc.)
|
||||
|
||||
**Soluție:**
|
||||
- **Combine fast + slow signals:**
|
||||
- Fast: Email open time (hours)
|
||||
- Slow: Reply engagement patterns (days)
|
||||
- Very slow: Monthly satisfaction review
|
||||
- **Weight fast signals lower** (more noise), slow signals higher (more signal)
|
||||
|
||||
---
|
||||
|
||||
#### 4. Human-in-the-Loop Bias
|
||||
|
||||
**Problema:**
|
||||
- Dacă Marius dă feedback bazat pe vibes (nu data), loop se degradează
|
||||
- "Mi-a plăcut raportul ăsta" ≠ "Raportul ăsta m-a ajutat să iau decizie"
|
||||
|
||||
**Soluție:**
|
||||
- **Prioritize objective metrics** > human feedback
|
||||
- **Ask specific questions:** "Ce secțiune a fost cea mai utilă?" (nu "Ți-a plăcut?")
|
||||
- **Track behavior, not opinions:** Open time, reply time, action taken (mai reliable decât "rating 1-10")
|
||||
|
||||
**Implementation:**
|
||||
```python
|
||||
feedback_weights = {
|
||||
'objective_metric': 0.5, # CTR, reply time, open rate
|
||||
'controlled_test': 0.3, # A/B splits
|
||||
'eval_criteria': 0.15, # Binary checks
|
||||
'human_feedback': 0.05 # Lowest weight (most biased)
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 📊 Metrici de Success pentru Echo
|
||||
|
||||
Dacă implementăm autoresearch loop pentru rapoarte/insights/emails:
|
||||
|
||||
#### Baseline (Current - Unknown)
|
||||
|
||||
**Morning Reports:**
|
||||
- Generation time: ~5min (estimate)
|
||||
- Marius reply rate: ?% (not tracked)
|
||||
- Open time: ?h (not tracked)
|
||||
- Sections clicked: ? (not tracked)
|
||||
|
||||
**YouTube Processing:**
|
||||
- Generation time: ~3min (estimate)
|
||||
- Insights execution rate: ?% [x] vs [ ] (not systematically tracked)
|
||||
- Follow-up tasks: ? (not tracked)
|
||||
|
||||
**Email Communication:**
|
||||
- Draft time: ~2min (estimate)
|
||||
- Reply time: ?h average (not tracked)
|
||||
- Action items completed: ?% (not tracked)
|
||||
|
||||
---
|
||||
|
||||
#### Target (Cu Autoresearch - 3 Months)
|
||||
|
||||
**Morning Reports:**
|
||||
- Generation time: <3min (fast iterations reduce back-and-forth)
|
||||
- Marius reply rate: >70% (mai relevant content)
|
||||
- Open time: <1h for 80% of reports (better subject lines)
|
||||
- Sections clicked: Track + optimize (feedback JSON)
|
||||
|
||||
**YouTube Processing:**
|
||||
- Generation time: <2min (optimized prompts)
|
||||
- Insights execution rate: >50% [x] (mai actionable)
|
||||
- Follow-up tasks: 30%+ of relevant videos (better filtering)
|
||||
|
||||
**Email Communication:**
|
||||
- Draft time: <1min (learned patterns)
|
||||
- Reply time: <12h average (clearer action items)
|
||||
- Action items completed: >80% (better framing)
|
||||
|
||||
---
|
||||
|
||||
#### Tracking Implementation
|
||||
|
||||
**Nou: `memory/feedback/analytics.db` (SQLite)**
|
||||
```sql
|
||||
CREATE TABLE events (
|
||||
id INTEGER PRIMARY KEY,
|
||||
domain TEXT, -- 'morning_report', 'youtube', 'email'
|
||||
event_type TEXT, -- 'open', 'reply', 'execute_insight', 'click'
|
||||
metadata JSON, -- {report_id, section, timestamp, etc.}
|
||||
timestamp INTEGER
|
||||
);
|
||||
|
||||
CREATE TABLE feedback_rules (
|
||||
id INTEGER PRIMARY KEY,
|
||||
domain TEXT,
|
||||
rule TEXT,
|
||||
confidence REAL,
|
||||
source TEXT, -- 'api', 'split_test', 'human', 'eval'
|
||||
rationale TEXT,
|
||||
created INTEGER,
|
||||
last_updated INTEGER
|
||||
);
|
||||
```
|
||||
|
||||
**Dashboard tracking:**
|
||||
```python
|
||||
# Extend dashboard/index.html cu Analytics tab
|
||||
# Show:
|
||||
# - Eval score trends over time (improving?)
|
||||
# - Outcome metrics (reply rate, open time, execution rate)
|
||||
# - Correlation: eval vs outcome (detect false positives)
|
||||
# - Top rules by confidence
|
||||
# - Recent feedback events
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔗 Link-uri & Resurse
|
||||
|
||||
- **Video:** https://youtu.be/0PO6m09_80Q
|
||||
- **Karpathy Autoresearch:** https://github.com/karpathy/autoresearch (referenced)
|
||||
- **YouTube Reporting API:** https://developers.google.com/youtube/reporting
|
||||
- **YouTube Analytics API:** https://developers.google.com/youtube/analytics
|
||||
- **Gemini Vision:** Used for thumbnail scoring
|
||||
|
||||
**Cohort mentioned:**
|
||||
- Live build session: March 23rd (Monday & Thursday)
|
||||
- Free community: ~1,000 members, "AI agent classroom"
|
||||
- Python file: 1,000 lines (shared în community)
|
||||
|
||||
---
|
||||
|
||||
## 📝 Note Suplimentare
|
||||
|
||||
### Gap Performance Original
|
||||
- **Old thumbnails (3 ani):** 14-18% CTR (best performers)
|
||||
- **Recent thumbnails:** 3.4-9% CTR
|
||||
- **Gap:** 10+ percentage points → motivație pentru autoresearch
|
||||
|
||||
### ABC Split Test Winner
|
||||
- **A (abstract/text-heavy):** 51% preference
|
||||
- **B (mid):** 28%
|
||||
- **C (author face):** 21% (lowest - "That hurts")
|
||||
|
||||
### Implementation Details
|
||||
- **Airtable:** Used pentru storing video data (500+ videos)
|
||||
- **Gemini Vision:** Scoring thumbnails vs criteria
|
||||
- **1,000 lines Python:** Entire autoresearch system
|
||||
- **Fast iterations:** 10 cycles, 3 thumbnails each = 30 total generated
|
||||
- **Final winner:** 11/12 score (doar 1 criterion failed)
|
||||
|
||||
### Author's Other Systems
|
||||
- **AI clone for social media:** Instagram/Facebook reels (35k views, automated)
|
||||
- **Thumbnail skill:** Existing skill în OpenClaw/Claude Code pentru quick generation
|
||||
|
||||
---
|
||||
|
||||
**Status:** [ ] Discută cu Marius: Implementăm autoresearch pentru Echo rapoarte?
|
||||
**Priority:** High - pattern universal, beneficiu mare pe termen lung
|
||||
**Estimat efort:** 10-15h setup initial (toate 3 domenii), apoi automat
|
||||
**ROI:** Compounding improvements - fiecare raport/insight mai bun decât ultimul
|
||||
Reference in New Issue
Block a user