# Build Powerful Local Coding Agent on Budget GPU with Llama.cpp and Pi

**URL:** https://youtu.be/0AqpaFm11oI?si=LGIuBQD1ptTv7vGn
**Data:** 2026-05-30
**Durata:** 16:56
**Tags:** @work @growth #local-ai #llama-cpp #coding-agent #moe #hardware

---

## TL;DR

Cum rulezi un coding agent local la nivel "mid-frontier" (comparabil cu Claude Code) pe un GPU de buget (RTX 3060, 12GB VRAM) fără rate limit și fără abonament cloud. Ingredientele: modele MoE REAP cuantizate Q4, tuning agresiv llama.cpp (threads + ubatch + KV compression), agentul Pyi, și Tailscale pentru acces remote.

---

## Puncte cheie

- **MoE > Dense la cost echivalent** — un model MoE de 30B rulează la viteza unui model dense de 3B. Toate modelele frontier (GPT, Claude) sunt MoE la trilioane de parametri. Sweet spot pentru muncă reală: 20-40B parametri.

- **REAP pruning** — paper Cerebras: se pot elimina 20% din experții MoE neutilizați. Modelele pruned sunt mai mici + uneori *mai bune* pe benchmark-uri (HumanEval: 95.1 vs 94.5 nepruned). Unsloth oferă variante REAP pentru Qwen 3.6B MoE și GLM 4.7B 23B.

- **Ierarhia de performanță:** VRAM > RAM speed / PCI bandwidth > CPU cores. DDR4 = bottleneck ~54 GB/s → ~50 tokens/s maxim la decode dacă modelul e în RAM.

- **Ubatch = cheia pentru prompt processing rapid** (critic pentru agenți):
  - Ubatch 256 → 300 tokens/s prefill (Qwen)
  - Ubatch 2048 → 1,142 tokens/s prefill — aproape 4x mai rapid
  - TG (decode) rămâne neschimbat — ubatch afectează DOAR prefill-ul
  - Trade-off: ubatch mare consumă VRAM

- **Threads optim = CPU cores - 1**, nu maxim. La 4 core CPU: thread 3 = 39.5 tok/s, thread 4 = colapsat la 22 tok/s. Un core trebuie lăsat pentru scheduling + GPU management.

- **KV Compression (TurboQuant):**
  - Keys (K) → Turbo4 (near lossless)
  - Values (V) → Turbo2 (forma vectorului, nu precizia exactă)
  - GLM: +12% decode, -25% prefill — trade-off clasic
  - Qwen: +4% prefill, +5% decode — win pur
  - Cu cât modelul e mai mare față de VRAM, cu atât compression câștigă mai mult (VRAM eliberat → layere extra pe GPU)

- **Cache reuse llama.cpp:** împarte prompt cache în chunk-uri de 256 tokens. La modificare parțială a promptului, reprocessează doar chunk-urile modificate → TTFT mai rapid pentru agenți.

- **Model presets (models.ini):** llama.server poate gestiona mai multe modele configurate. Switch din Pyi (`/models`) → serverul unload + load automat. Nu mai trebuie restart manual.

- **Tailscale pentru remote:** instalezi pe AI rig + laptop → accesezi llama.server cu IP Tailscale de oriunde. Experiență identică cu un agent cloud.

- **Agentul recomandat: Pyi** — lightweight, customizabil, suport nativ llama.cpp fără middleware. `pip install mcp-pi-llama-cpp` + URL în settings.json.

---

## Quote-uri notabile

> "It doesn't matter how well or how much we optimize, it will never beat a model that is totally loaded into the VRAM of a GPU."

> "All the frontier models are trillion parameter models with an MoE architecture. Why do you think frontier labs are doing that? They don't have the hardware to run a dense 1 trillion parameter model."

> "Agents are mostly pre-fill. Processing the long system prompt with instructions, MCP content, tool usage details, documents, and code files."

> "A lot of the time we see people optimize for the token speed... but to run agents we actually need some prompt processing speed. It is much more important than the token speed that we are chasing."

> "No subscription, no API key, and no rate limit. It's already yours and you can run it as much as you want as long as you can pay for the electricity bill."

---

## Setup recomandat (RTX 3060 12GB)

| Component | Alegere |
|-----------|---------|
| GPU | RTX 3060 12GB (sau orice VRAM ≥ 8GB) |
| Model 1 (cod) | Qwen 3.6B MoE REAP Q4_KM (Unsloth) |
| Model 2 (general) | GLM 4.7 Flash REAP 23B Q4_KM |
| Quantizare | Q4 KM sau Unsloth dynamic Q4 |
| Threads | CPU cores - 1 (ex: 3 din 4 cores) |
| Ubatch | 1024 (870 tok/s prefill cu VRAM headroom) |
| KV Compression | K=Turbo4, V=Turbo2 |
| Agent | Pyi (PyCode agent) |
| Remote access | Tailscale |

---

## Relevanță pentru Echo / ROA

- **Potențial:** Un setup local cu RTX 3060 + Pyi ar putea rula un coding agent autonom (similar Ralph) fără cost API. Rate limit = 0. Util dacă Anthropic limitează.
- **Pragmatic (80/20):** Actualmete Echo + Ralph rulează pe subscription Anthropic Pro, cost OK. Setup local = efort hardware semnificativ. De monitorizat ca alternativă, nu de acționat imediat.
- **Insight cheie pentru orice LLM local:** prefill speed > decode speed pentru use-case-uri agentic (routers, heartbeats, job-uri cron cu context mare).