Files
echo-core/memory/kb/tools/infrastructure.md
Marius Mutu bee409d164 docs(kb): update infrastructure with HA, corosync tuning, OOM alerting
- Clone romfastsql repo local pe /home/moltbot/workspace/romfastsql/
- Fix: LXC 171 e pe pvemini, nu pveelite
- Adaug secțiuni lipsă: HA groups, corosync token tuning (post-incident 2026-04-20)
- Diagnostic tools: rasdaemon, netconsole, kdump-tools
- OOM alerting, mail notifications, swap pveelite

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-04-26 12:06:29 +00:00

17 KiB

Infrastructură (Proxmox + Docker)

Ultima actualizare: 2026-04-26. Sync cu romfastsql/proxmox/ din Gitea. Repo clonat local: /home/moltbot/workspace/romfastsql/ (HTTPS, fără SSH key) Documentație detaliată per LXC/VM: romfastsql/proxmox/<component>/README.md

Acces rapid LXC

ID Nume Nod IP SSH direct Via Proxmox
100 portainer pvemini 10.0.20.170 ssh echo@10.0.20.170 ssh echo@10.0.20.201 "sudo pct exec 100 -- bash"
101 minecraft pveelite 10.0.20.162 ssh echo@10.0.20.162 ssh echo@10.0.20.202 "sudo pct exec 101 -- bash"
103 dokploy pvemini 10.0.20.167 ssh echo@10.0.20.167 ssh echo@10.0.20.201 "sudo pct exec 103 -- bash"
104 flowise pvemini 10.0.20.161 (publickey only) ssh echo@10.0.20.201 "sudo pct exec 104 -- bash"
106 gitea pvemini 10.0.20.165 ssh echo@10.0.20.201 "sudo pct exec 106 -- sh" ⚠️ Alpine (sh, nu bash)
108 central-oracle pvemini 10.0.20.121 ssh echo@10.0.20.121 ssh echo@10.0.20.201 "sudo pct exec 108 -- bash"
110 moltbot pveelite 10.0.20.173 ssh moltbot@10.0.20.173 ssh echo@10.0.20.202 "sudo pct exec 110 -- bash"
171 claude-agent pvemini ⚠️ 10.0.20.171 ssh claude@10.0.20.171 ssh echo@10.0.20.201 "sudo pct exec 171 -- bash"

LXC 100 — portainer (pvemini)

  • IP: 10.0.20.170 | OS: Debian/systemd | Tailscale: Da
  • Resurse: 4GB RAM (414MB used) | 20GB disk (4GB used, 20%)
  • Portainer UI: https://10.0.20.170:9443
  • Proiecte docker-compose: /opt/docker/

Containere Docker:

Container Port extern Status Descriere
portainer 9443 healthy Management Docker
hbbs 21115-21116, 21118 RustDesk relay (STUN)
hbbr 21117, 21119 RustDesk relay (TURN)
pulse 7655 healthy Monitoring Proxmox
wol-manager Wake-on-LAN
bt-web-automation 5000, 8081→8080 BT automation
roa-efactura 5003→5000 ⚠️ unhealthy E-Factura ANAF
pdf-qr-app 5002→5000 healthy QR facturi
docker-flask_app-1 5001→5000 ROA Flask

Depanare:

# Logs container
ssh echo@10.0.20.201 "sudo pct exec 100 -- docker logs <container> --tail 50"
# Restart container
ssh echo@10.0.20.201 "sudo pct exec 100 -- docker restart <container>"
# Status toate
ssh echo@10.0.20.201 "sudo pct exec 100 -- docker ps -a"

LXC 101 — minecraft (pveelite)

  • IP: 10.0.20.162 | OS: Debian/systemd | Tailscale: Nu
  • Resurse: 8GB RAM (3.8GB used) | 100GB disk (49GB used, 49%)

Servicii:

Serviciu Port Descriere
crafty 8443 Crafty4 web panel (Python)
minecraft 25565 Server Minecraft (Java)
playit Tunel public pentru Minecraft

Depanare:

ssh echo@10.0.20.202 "sudo pct exec 101 -- systemctl status crafty"
ssh echo@10.0.20.202 "sudo pct exec 101 -- journalctl -u crafty -n 50"

LXC 103 — dokploy (pvemini)

  • IP: 10.0.20.167 | OS: Debian/systemd | Tailscale: Da
  • Resurse: 4GB RAM (1.1GB used) | 50GB disk (5.8GB used, 12%)
  • Dokploy UI: http://10.0.20.167:3000

Containere Docker (managed by Dokploy + Traefik):

Container Port Status Descriere
dokploy-traefik 80, 443 Reverse proxy
dokploy 3000 healthy Deployment platform
dokploy-postgres 5432 (intern) DB Dokploy
dokploy-redis 6379 (intern) Cache Dokploy
utile-icongenerator 80 (intern) Icon generator
qr-qrgenerator 80 (intern) QR generator
qr-pdfqrapp PDF+QR app
constanta-space-booking-backend 8000 (intern) Space booking API

Depanare:

ssh echo@10.0.20.201 "sudo pct exec 103 -- docker ps -a"
ssh echo@10.0.20.201 "sudo pct exec 103 -- docker logs <container> --tail 50"

LXC 104 — flowise (pvemini)

  • IP: 10.0.20.161 | OS: Debian/systemd | Tailscale: Da
  • Resurse: 8GB RAM (418MB used) | 100GB disk (23GB used, 23%)
  • SSH direct: (nu merge cu user echo, doar via pct exec)

Servicii:

Serviciu Port Status Descriere
ollama 127.0.0.1:11434 LLM local (CPU-only, avx2)
flowise 3000 Flow builder AI
ngrok Tunel public

Ollama — modele disponibile:

  • all-minilm:latest — embeddings rapid ← folosit de echo-core memory_search
  • nomic-embed-text:latest — embeddings calitate
  • llama3.2:3b-instruct-q8_0 — LLM conversație
  • llama3.2:3b, llama3.2:1b — LLM general
  • smollm:135m — LLM mic rapid

Note importante:

  • Modele stocate în /usr/share/ollama/.ollama/models/ (user ollama)
  • Serviciul ollama rulează ca user ollama, cu HOME=/usr/share/ollama
  • CPU-only — fără GPU; fără CUDA/ROCm

Depanare Ollama:

# Status
ssh echo@10.0.20.201 "sudo pct exec 104 -- systemctl status ollama"
# Logs (problemă frecventă: $HOME undefined sau permisiuni)
ssh echo@10.0.20.201 "sudo pct exec 104 -- journalctl -u ollama -n 30"
# Fix permisiuni (dacă ollama nu pornește)
ssh echo@10.0.20.201 "sudo pct exec 104 -- chown -R ollama:ollama /usr/share/ollama/.ollama/"
# Test API
ssh echo@10.0.20.201 "sudo pct exec 104 -- curl -s http://localhost:11434/api/tags"
# Pull model
ssh echo@10.0.20.201 "sudo pct exec 104 -- ollama pull all-minilm"

LXC 106 — gitea (pvemini)

  • IP: 10.0.20.165 | OS: Alpine Linux + OpenRC ⚠️ (nu systemd, nu bash!)
  • Resurse: disk 250GB (1.1GB used, 0%)
  • Gitea web: http://10.0.20.165:3000 (sau gitea.romfast.ro)
  • Gitea SSH: port 222

Particularități Alpine:

  • Shell: sh (nu bash) — pct exec 106 -- sh
  • Init: OpenRC (nu systemd) — rc-status, nu systemctl
  • Gitea rulează prin Docker + s6, nu nativ

Servicii OpenRC:

Serviciu Status Descriere
networking Rețea
tailscale VPN
crond Cron
tailscale-gitea CRASHED Script Tailscale custom — de investigat

Depanare:

# Acces (folosește sh, nu bash!)
ssh echo@10.0.20.201 "sudo pct exec 106 -- sh -c 'rc-status'"
ssh echo@10.0.20.201 "sudo pct exec 106 -- sh -c 'docker ps'"
# Logs tailscale-gitea
ssh echo@10.0.20.201 "sudo pct exec 106 -- sh -c 'cat /var/log/tailscale-gitea.log 2>/dev/null || rc-service tailscale-gitea status'"

LXC 108 — central-oracle (pvemini)

  • IP: 10.0.20.121 | OS: Debian/systemd | Tailscale: Nu
  • Resurse: 8GB RAM (4.2GB used) | 50GB disk (15GB used, 29%)

Containere Docker:

Container Port Status Descriere
oracle-xe 1521, 5500 (EM Express) healthy Oracle XE principal
oracle18-xe 1522→1521, 5502→5500 Oracle 18 XE
portainer 9000, 9443, 8000 Management local

Depanare Oracle:

# Status
ssh echo@10.0.20.201 "sudo pct exec 108 -- docker ps -a"
# Logs Oracle
ssh echo@10.0.20.201 "sudo pct exec 108 -- docker logs oracle-xe --tail 50"
# Intră în container Oracle
ssh echo@10.0.20.201 "sudo pct exec 108 -- docker exec -it oracle-xe bash"

LXC 110 — moltbot (pveelite)

  • IP: 10.0.20.173 | Tailscale IP: 100.120.119.70 | OS: Debian/systemd | Tailscale: Da
  • Resurse: 4GB RAM | 8GB disk (local-zfs) | 2 cores
  • SSH direct: ssh moltbot@10.0.20.173 (user dedicat non-root)
  • Acesta este LXC-ul pe care rulează echo-core (OpenClaw)

Servicii:

Serviciu Port Descriere
code-server@moltbot 8080 VS Code în browser
ttyd 7681 Web terminal
echo-core dashboard 8088 Echo Task Board
whatsapp-bridge 8098 Baileys bridge (Node.js)
fail2ban Protecție SSH

LXC 171 — claude-agent (pveelite)

  • IP: 10.0.20.171 | Tailscale: 100.95.55.51 | OS: Ubuntu 24.04 LTS/systemd
  • Resurse: 4GB RAM | 32GB disk (local-zfs) | 2 cores
  • User principal: claude | Workspace: /workspace/

Servicii:

Serviciu Port Descriere
code-server@claude 8080 VS Code (user: claude)
ttyd 7681 Web terminal (/workspace/start-agent.sh, auth: claude:claude2025)

Claude Code:

  • Instalat și configurat, Git → gitea.romfast.ro
  • Mod programatic: claude -p "task" din directorul proiectului

Proiecte în /workspace/ → detalii complete în kb/tools/claude-agent-projects.md

Proiect Stack Scop
roa2web FastAPI + Vue.js + Oracle ERP web modern ROA
roaauto Vue 3 + wa-sqlite + FastAPI PWA service auto (offline-first)
vfp_roaauto Visual FoxPro (legacy) ROA AUTO versiunea VFP
romfastsql Docs + SQL + Python Infrastructură + Oracle migrare
gomag-vending FastAPI + Oracle PL/SQL Import comenzi GoMag → ROA
space-booking FastAPI + SQLite + Vue Rezervări birouri multi-tenant
service-auto Vue 3 + Vite + Tailwind 4 PWA service auto (versiune nouă)
atm Python 3.11+ Automated Trading Monitor (M2D)
paula-escape HTML Escape room joc

Depanare:

ssh echo@10.0.20.202 "sudo pct exec 171 -- systemctl status code-server@claude ttyd"
ssh echo@10.0.20.202 "sudo pct exec 171 -- df -h /"

VM 201 — roacentral (pvemini)

  • VMID: 201 | Host: pvemini | Status: Running (autostart)
  • OS: Windows 11 Pro (24H2) | QEMU Guest Agent: Da
  • Resurse: 2 cores | 4GB RAM | 500GB disk (local-zfs, ~89GB used)
  • Network: virtio bridge (DHCP) | RDP: port 3389

Rol principal — Reverse Proxy IIS:

Domeniu Destinație
roa.romfast.ro aplicație ROA
gitea.romfast.ro LXC 106
dokploy.romfast.ro LXC 103 Traefik
roa-qr.romfast.ro LXC 103 Traefik
*.roa.romfast.ro Dokploy wildcard

Servicii instalate:

  • IIS 10.0 — ASP.NET 4.8, WebSockets, URL Rewrite, SSL termination
  • Win-ACME v2.2.9 — certificate Let's Encrypt automate
  • Oracle Instant Client — JDBC client pentru LXC 108
  • WinNUT — UPS monitor (NUT server: 10.0.20.201:3493)

Backup & Replicare:

  • Backup zilnic 02:00 (zstd comprimat)
  • ZFS replication activă: pvemini → pve1 + pveelite (interval 30 min)
  • HA dezactivat — pornire manuală la failover

VM 109 — oracle-dr (pveelite)

  • VMID: 109 | Host: pveelite | Status: Stopped (pornit doar pentru DR/test)
  • IP: 10.0.20.37 | OS: Windows Server + Oracle 19c
  • HA group: ha-prefer-pveelite | state=stopped, nofailback=1
  • Scop: Disaster Recovery pentru Oracle Database (backup RMAN de pe server Windows extern)

Oracle Database:

  • DB Name: ROA | Dimensiune: ~80 GB | Tabele: 42.625
  • Strategie: full backup zilnic (6-7 GB) + cumulative incremental (200-300 MB)

Schedule backup RMAN:

Oră Tip
02:30 Full backup
13:00 Cumulative incremental
18:00 Cumulative incremental
09:00 Monitorizare automată

Depanare:

ssh echo@10.0.20.202 "sudo qm status 109"

VM 302 — oracle-test (pvemini)

  • VMID: 302 | Host: pvemini | Status: Stopped (test on-demand)
  • IP: 10.0.20.130 | OS: Windows 11
  • Resurse: 4GB RAM | 500GB disk
  • Scop: Mediu de test pentru scripturi instalare ROA pe Windows cu Oracle 21c XE

Oracle Configuration:

  • Ediție: Oracle 21c XE (CDB/PDB) | Port: 1521 | Service: XEPDB1
  • Setup dir: C:\roa-setup\ | DMP files: C:\DMPDIR\
  • Instalare completă: ~8 minute

Depanare:

ssh echo@10.0.20.201 "sudo qm status 302"

Server Windows extern — producție

Mașină IP Port Rol
Oracle producție 10.0.20.36 1521 Oracle 10g Windows, baza de date principală ROA

Proxmox Noduri

Versiune: Proxmox VE 8.4.14 | Cluster: romfast (3 noduri, quorum activ) User: echo | Acces SSH: ssh echo@<IP> | Sudo: qm, pct, pvesh

Storage cluster:

Storage Tip Capacitate Scop
local-zfs ZFS Pool 1.75 TiB Diskuri VM/LXC
backup Directory 1.79 TiB Backup-uri (pvemini only)
local Directory 1.51 TiB ISO-uri și template-uri

pvemini (10.0.20.201) — host principal

  • Resurse: 64GB RAM, 1.4TB disk
  • LXC-uri: 100(running), 103(running), 104(running), 105(stopped), 106(running), 108(running), 171(running)
  • VM-uri: 201(running), 300(stopped — Windows 11 template), 302(stopped — oracle test)
  • Backup zilnic 02:00: LXC 100, 104, 106, 108, VM 201 → storage "backup"

Scripturi /opt/scripts/:

  • ha-monitor.sh — zilnic 00:00, status cluster HA
  • monitor-ssl-certificates.sh — verifică SSL-uri
  • ups-shutdown-cluster.sh — shutdown orchestrat la UPS critic
  • ups-monthly-test.sh — 1 ale lunii, test baterie UPS
  • ups-maintenance-shutdown.sh — shutdown mentenanță UPS
  • vm107-monitor.sh — monitorizare VM 107

pveelite (10.0.20.202)

  • Resurse: 16GB RAM, 557GB disk (+ 8GB ZFS swap — adăugat 2026-04-20 anti-OOM)
  • LXC-uri: 101(running), 105(stopped), 110(running), 301(stopped)
  • VM-uri: 109(stopped — oracle DR)
  • Backup zilnic 22:00: LXC 101, 110 → backup-pvemini-nfs

Scripturi /opt/scripts/:

  • oracle-backup-monitor-proxmox.sh — zilnic 21:00, verifică backup Oracle
  • weekly-dr-test-proxmox.sh — sâmbătă 06:00, test restore Oracle DR (VM 109)

pve1 (10.0.20.200)

  • Resurse: 32GB RAM, 1.3TB disk
  • Status: Gol (fără VM/LXC activ)

Servicii LLM/AI locale

Serviciu LXC IP:Port Note
Ollama 104 flowise 10.0.20.161:11434 CPU-only; modele: all-minilm, nomic-embed-text, llama3.2
Flowise 104 flowise 10.0.20.161:3000 Flow builder AI

High Availability (HA)

Grupuri HA:

ha-group-main  → pvemini (100), pveelite (50), pve1 (33)
ha-group-elite → pveelite (100), pve1 (33), pvemini (50)

Resurse HA active:

Resursă Grup Max restart Max relocate Notă
ct:100 portainer ha-group-main 3 3
ct:101 minecraft ha-group-elite 3 3 Rulează pe pveelite
ct:104 flowise ha-group-main 3 2 Limite adăugate 2026-04-20
ct:106 gitea ha-group-main 3 3
ct:108 central-oracle ha-group-main 3 2 Limite adăugate 2026-04-20

VM 109 NU mai e în HA — scos 2026-04-20 după buclă OOM. Pornit exclusiv manual (DR test săptămânal sâmbătă 06:00).

# Verificare HA
ssh echo@10.0.20.201 "sudo ha-manager status"
# Modificare limite (exemplu)
ssh echo@10.0.20.201 "sudo ha-manager set ct:108 --max_restart 3 --max_relocate 2"

Corosync Tuning (post-incident 2026-04-20)

Token mărit la 10000ms (default: 1000ms) — tolerează USB disconnect scurt pe pveelite fără reboot forțat.

# Verificare
ssh echo@10.0.20.201 "sudo corosync-cmapctl | grep 'totem.token '"
# runtime.config.totem.token (u32) = 10650
# totem.token (u32) = 10000

Diagnostic Tools (instalate 2026-04-20)

rasdaemon — MCE + PCIe AER monitoring

ssh echo@10.0.20.201 "sudo ras-mc-ctl --summary"

netconsole — kernel logs → pve1

Dacă pvemini crashează hard, ultimele linii kernel se găsesc pe pve1:

ssh echo@10.0.20.200 "sudo tail /var/log/netconsole-pvemini.log"
ssh echo@10.0.20.200 "sudo systemctl status netconsole-receiver"

kdump-tools — captură crash dump

ssh echo@10.0.20.201 "sudo systemctl is-active kdump-tools"
# Dump-uri la crash: /var/crash/ pe pvemini

kernel.panic auto-reboot

ssh echo@10.0.20.201 "sudo sysctl kernel.panic"
# kernel.panic = 10 → auto-reboot după 10s la kernel panic

OOM Alerting

Script /opt/scripts/oom-alert.sh pe toate 3 nodurile — cron la 1 minut — trimite mail la mmarius28@gmail.com dacă detectează OOM kill.

# Verificare instalat pe toate nodurile
for ip in 10.0.20.200 10.0.20.201 10.0.20.202; do
    ssh echo@$ip "sudo crontab -l | grep oom-alert"
done

Mail Notifications (Proxmox → mail.romfast.ro)

Toate 3 nodurile trimit prin mail.romfast.ro:465 cu ups@romfast.ro.

# Test rapid
ssh echo@10.0.20.201 "echo 'test' | sudo mail -r 'ups@romfast.ro' -s 'test pvemini' mmarius28@gmail.com"
ssh echo@10.0.20.201 "sudo journalctl -u 'postfix@-' --since '1 min ago' | grep status="
# Trebuie: status=sent (250 OK ...)

Swap pe pveelite (8GB ZFS zvol)

Adăugat 2026-04-20 anti-OOM (pveelite are 16GB RAM).

ssh echo@10.0.20.202 "sudo swapon --show; sudo sysctl vm.swappiness"
# swappiness: 10 (swap doar sub presiune reală)

Alertă automată când

  • Container/VM down neașteptat
  • Disk >85% utilizare pe orice container/VM
  • Serviciu unhealthy >1h
  • Erori repetate în logs

Acțiunez singur (fără să întreb)

  • Monitorizare și citire status
  • Diagnozare: logs, configurații, health checks
  • Fix-uri safe: permisiuni, restart servicii

Întreb întâi

  • Start/Stop VM sau LXC
  • Modificări configurare (network, storage, resurse)
  • Orice operație distructivă