diff --git a/memory/kb/tools/infrastructure.md b/memory/kb/tools/infrastructure.md index 43b0c7b..b771cf1 100644 --- a/memory/kb/tools/infrastructure.md +++ b/memory/kb/tools/infrastructure.md @@ -1,6 +1,8 @@ # Infrastructură (Proxmox + Docker) -> Ultima actualizare: 2026-04-25. Sync cu romfastsql/proxmox/ din Gitea. +> Ultima actualizare: 2026-04-26. Sync cu romfastsql/proxmox/ din Gitea. +> Repo clonat local: `/home/moltbot/workspace/romfastsql/` (HTTPS, fără SSH key) +> Documentație detaliată per LXC/VM: `romfastsql/proxmox//README.md` ## Acces rapid LXC @@ -13,7 +15,7 @@ | 106 | gitea | pvemini | 10.0.20.165 | — | `ssh echo@10.0.20.201 "sudo pct exec 106 -- sh"` ⚠️ Alpine (sh, nu bash) | | 108 | central-oracle | pvemini | 10.0.20.121 | `ssh echo@10.0.20.121` | `ssh echo@10.0.20.201 "sudo pct exec 108 -- bash"` | | 110 | moltbot | pveelite | 10.0.20.173 | `ssh moltbot@10.0.20.173` | `ssh echo@10.0.20.202 "sudo pct exec 110 -- bash"` | -| 171 | claude-agent | pveelite | 10.0.20.171 | `ssh claude@10.0.20.171` | `ssh echo@10.0.20.202 "sudo pct exec 171 -- bash"` | +| 171 | claude-agent | pvemini ⚠️ | 10.0.20.171 | `ssh claude@10.0.20.171` | `ssh echo@10.0.20.201 "sudo pct exec 171 -- bash"` | --- @@ -343,7 +345,7 @@ ssh echo@10.0.20.201 "sudo qm status 302" ### pvemini (10.0.20.201) — host principal - **Resurse:** 64GB RAM, 1.4TB disk -- **LXC-uri:** 100(running), 103(running), 104(running), 105(stopped), 106(running), 108(running) +- **LXC-uri:** 100(running), 103(running), 104(running), 105(stopped), 106(running), 108(running), 171(running) - **VM-uri:** 201(running), 300(stopped — Windows 11 template), 302(stopped — oracle test) - **Backup zilnic 02:00:** LXC 100, 104, 106, 108, VM 201 → storage "backup" @@ -356,10 +358,10 @@ ssh echo@10.0.20.201 "sudo qm status 302" - `vm107-monitor.sh` — monitorizare VM 107 ### pveelite (10.0.20.202) -- **Resurse:** 16GB RAM, 557GB disk (+ 8GB ZFS swap) -- **LXC-uri:** 101(running), 105(stopped), 110(running), 171(running), 301(stopped) +- **Resurse:** 16GB RAM, 557GB disk (+ 8GB ZFS swap — adăugat 2026-04-20 anti-OOM) +- **LXC-uri:** 101(running), 105(stopped), 110(running), 301(stopped) - **VM-uri:** 109(stopped — oracle DR) -- **Backup zilnic 22:00:** LXC 101, 110, 171 → backup-pvemini-nfs +- **Backup zilnic 22:00:** LXC 101, 110 → backup-pvemini-nfs **Scripturi `/opt/scripts/`:** - `oracle-backup-monitor-proxmox.sh` — zilnic 21:00, verifică backup Oracle @@ -380,6 +382,112 @@ ssh echo@10.0.20.201 "sudo qm status 302" --- +## High Availability (HA) + +**Grupuri HA:** +``` +ha-group-main → pvemini (100), pveelite (50), pve1 (33) +ha-group-elite → pveelite (100), pve1 (33), pvemini (50) +``` + +**Resurse HA active:** +| Resursă | Grup | Max restart | Max relocate | Notă | +|---------|------|-------------|--------------|------| +| ct:100 portainer | ha-group-main | 3 | 3 | | +| ct:101 minecraft | ha-group-elite | 3 | 3 | Rulează pe pveelite | +| ct:104 flowise | ha-group-main | 3 | 2 | Limite adăugate 2026-04-20 | +| ct:106 gitea | ha-group-main | 3 | 3 | | +| ct:108 central-oracle | ha-group-main | 3 | 2 | Limite adăugate 2026-04-20 | + +**VM 109 NU mai e în HA** — scos 2026-04-20 după buclă OOM. Pornit exclusiv manual (DR test săptămânal sâmbătă 06:00). + +```bash +# Verificare HA +ssh echo@10.0.20.201 "sudo ha-manager status" +# Modificare limite (exemplu) +ssh echo@10.0.20.201 "sudo ha-manager set ct:108 --max_restart 3 --max_relocate 2" +``` + +--- + +## Corosync Tuning (post-incident 2026-04-20) + +Token mărit la 10000ms (default: 1000ms) — tolerează USB disconnect scurt pe pveelite fără reboot forțat. + +```bash +# Verificare +ssh echo@10.0.20.201 "sudo corosync-cmapctl | grep 'totem.token '" +# runtime.config.totem.token (u32) = 10650 +# totem.token (u32) = 10000 +``` + +--- + +## Diagnostic Tools (instalate 2026-04-20) + +### rasdaemon — MCE + PCIe AER monitoring +```bash +ssh echo@10.0.20.201 "sudo ras-mc-ctl --summary" +``` + +### netconsole — kernel logs → pve1 +Dacă pvemini crashează hard, ultimele linii kernel se găsesc pe pve1: +```bash +ssh echo@10.0.20.200 "sudo tail /var/log/netconsole-pvemini.log" +ssh echo@10.0.20.200 "sudo systemctl status netconsole-receiver" +``` + +### kdump-tools — captură crash dump +```bash +ssh echo@10.0.20.201 "sudo systemctl is-active kdump-tools" +# Dump-uri la crash: /var/crash/ pe pvemini +``` + +### kernel.panic auto-reboot +```bash +ssh echo@10.0.20.201 "sudo sysctl kernel.panic" +# kernel.panic = 10 → auto-reboot după 10s la kernel panic +``` + +--- + +## OOM Alerting + +Script `/opt/scripts/oom-alert.sh` pe toate 3 nodurile — cron la 1 minut — trimite mail la mmarius28@gmail.com dacă detectează OOM kill. + +```bash +# Verificare instalat pe toate nodurile +for ip in 10.0.20.200 10.0.20.201 10.0.20.202; do + ssh echo@$ip "sudo crontab -l | grep oom-alert" +done +``` + +--- + +## Mail Notifications (Proxmox → mail.romfast.ro) + +Toate 3 nodurile trimit prin `mail.romfast.ro:465` cu `ups@romfast.ro`. + +```bash +# Test rapid +ssh echo@10.0.20.201 "echo 'test' | sudo mail -r 'ups@romfast.ro' -s 'test pvemini' mmarius28@gmail.com" +ssh echo@10.0.20.201 "sudo journalctl -u 'postfix@-' --since '1 min ago' | grep status=" +# Trebuie: status=sent (250 OK ...) +``` + +--- + +## Swap pe pveelite (8GB ZFS zvol) + +Adăugat 2026-04-20 anti-OOM (pveelite are 16GB RAM). + +```bash +ssh echo@10.0.20.202 "sudo swapon --show; sudo sysctl vm.swappiness" +# swappiness: 10 (swap doar sub presiune reală) +``` + +--- + ## Alertă automată când - Container/VM down neașteptat