docs(kb): update infrastructure with HA, corosync tuning, OOM alerting

- Clone romfastsql repo local pe /home/moltbot/workspace/romfastsql/ - Fix: LXC 171 e pe pvemini, nu pveelite - Adaug secțiuni lipsă: HA groups, corosync token tuning (post-incident 2026-04-20) - Diagnostic tools: rasdaemon, netconsole, kdump-tools - OOM alerting, mail notifications, swap pveelite Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-04-26 12:06:29 +00:00
parent e4674b5dda
commit bee409d164
1 changed files with 114 additions and 6 deletions
--- a/memory/kb/tools/infrastructure.md
+++ b/memory/kb/tools/infrastructure.md
@@ -1,6 +1,8 @@
 # Infrastructură (Proxmox + Docker)

-> Ultima actualizare: 2026-04-25. Sync cu romfastsql/proxmox/ din Gitea.
+> Ultima actualizare: 2026-04-26. Sync cu romfastsql/proxmox/ din Gitea.
+> Repo clonat local: `/home/moltbot/workspace/romfastsql/` (HTTPS, fără SSH key)
+> Documentație detaliată per LXC/VM: `romfastsql/proxmox/<component>/README.md`

 ## Acces rapid LXC

@@ -13,7 +15,7 @@
 | 106 | gitea | pvemini | 10.0.20.165 | — | `ssh echo@10.0.20.201 "sudo pct exec 106 -- sh"` ⚠️ Alpine (sh, nu bash) |
 | 108 | central-oracle | pvemini | 10.0.20.121 | `ssh echo@10.0.20.121` | `ssh echo@10.0.20.201 "sudo pct exec 108 -- bash"` |
 | 110 | moltbot | pveelite | 10.0.20.173 | `ssh moltbot@10.0.20.173` | `ssh echo@10.0.20.202 "sudo pct exec 110 -- bash"` |
-| 171 | claude-agent | pveelite | 10.0.20.171 | `ssh claude@10.0.20.171` | `ssh echo@10.0.20.202 "sudo pct exec 171 -- bash"` |
+| 171 | claude-agent | pvemini ⚠️ | 10.0.20.171 | `ssh claude@10.0.20.171` | `ssh echo@10.0.20.201 "sudo pct exec 171 -- bash"` |

 ---

@@ -343,7 +345,7 @@ ssh echo@10.0.20.201 "sudo qm status 302"

 ### pvemini (10.0.20.201) — host principal
 - **Resurse:** 64GB RAM, 1.4TB disk
- **LXC-uri:** 100(running), 103(running), 104(running), 105(stopped), 106(running), 108(running)
+- **LXC-uri:** 100(running), 103(running), 104(running), 105(stopped), 106(running), 108(running), 171(running)
 - **VM-uri:** 201(running), 300(stopped — Windows 11 template), 302(stopped — oracle test)
 - **Backup zilnic 02:00:** LXC 100, 104, 106, 108, VM 201 → storage "backup"

@@ -356,10 +358,10 @@ ssh echo@10.0.20.201 "sudo qm status 302"
 - `vm107-monitor.sh` — monitorizare VM 107

 ### pveelite (10.0.20.202)
- **Resurse:** 16GB RAM, 557GB disk (+ 8GB ZFS swap)
- **LXC-uri:** 101(running), 105(stopped), 110(running), 171(running), 301(stopped)
+- **Resurse:** 16GB RAM, 557GB disk (+ 8GB ZFS swap — adăugat 2026-04-20 anti-OOM)
+- **LXC-uri:** 101(running), 105(stopped), 110(running), 301(stopped)
 - **VM-uri:** 109(stopped — oracle DR)
- **Backup zilnic 22:00:** LXC 101, 110, 171 → backup-pvemini-nfs
+- **Backup zilnic 22:00:** LXC 101, 110 → backup-pvemini-nfs

 **Scripturi `/opt/scripts/`:**
 - `oracle-backup-monitor-proxmox.sh` — zilnic 21:00, verifică backup Oracle
@@ -380,6 +382,112 @@ ssh echo@10.0.20.201 "sudo qm status 302"

 ---

+## High Availability (HA)
+
+**Grupuri HA:**
+```
+ha-group-main  → pvemini (100), pveelite (50), pve1 (33)
+ha-group-elite → pveelite (100), pve1 (33), pvemini (50)
+```
+
+**Resurse HA active:**
+| Resursă | Grup | Max restart | Max relocate | Notă |
+|---------|------|-------------|--------------|------|
+| ct:100 portainer | ha-group-main | 3 | 3 | |
+| ct:101 minecraft | ha-group-elite | 3 | 3 | Rulează pe pveelite |
+| ct:104 flowise | ha-group-main | 3 | 2 | Limite adăugate 2026-04-20 |
+| ct:106 gitea | ha-group-main | 3 | 3 | |
+| ct:108 central-oracle | ha-group-main | 3 | 2 | Limite adăugate 2026-04-20 |
+
+**VM 109 NU mai e în HA** — scos 2026-04-20 după buclă OOM. Pornit exclusiv manual (DR test săptămânal sâmbătă 06:00).
+
+```bash
+# Verificare HA
+ssh echo@10.0.20.201 "sudo ha-manager status"
+# Modificare limite (exemplu)
+ssh echo@10.0.20.201 "sudo ha-manager set ct:108 --max_restart 3 --max_relocate 2"
+```
+
+---
+
+## Corosync Tuning (post-incident 2026-04-20)
+
+Token mărit la 10000ms (default: 1000ms) — tolerează USB disconnect scurt pe pveelite fără reboot forțat.
+
+```bash
+# Verificare
+ssh echo@10.0.20.201 "sudo corosync-cmapctl | grep 'totem.token '"
+# runtime.config.totem.token (u32) = 10650
+# totem.token (u32) = 10000
+```
+
+---
+
+## Diagnostic Tools (instalate 2026-04-20)
+
+### rasdaemon — MCE + PCIe AER monitoring
+```bash
+ssh echo@10.0.20.201 "sudo ras-mc-ctl --summary"
+```
+
+### netconsole — kernel logs → pve1
+Dacă pvemini crashează hard, ultimele linii kernel se găsesc pe pve1:
+```bash
+ssh echo@10.0.20.200 "sudo tail /var/log/netconsole-pvemini.log"
+ssh echo@10.0.20.200 "sudo systemctl status netconsole-receiver"
+```
+
+### kdump-tools — captură crash dump
+```bash
+ssh echo@10.0.20.201 "sudo systemctl is-active kdump-tools"
+# Dump-uri la crash: /var/crash/ pe pvemini
+```
+
+### kernel.panic auto-reboot
+```bash
+ssh echo@10.0.20.201 "sudo sysctl kernel.panic"
+# kernel.panic = 10 → auto-reboot după 10s la kernel panic
+```
+
+---
+
+## OOM Alerting
+
+Script `/opt/scripts/oom-alert.sh` pe toate 3 nodurile — cron la 1 minut — trimite mail la mmarius28@gmail.com dacă detectează OOM kill.
+
+```bash
+# Verificare instalat pe toate nodurile
+for ip in 10.0.20.200 10.0.20.201 10.0.20.202; do
+    ssh echo@$ip "sudo crontab -l | grep oom-alert"
+done
+```
+
+---
+
+## Mail Notifications (Proxmox → mail.romfast.ro)
+
+Toate 3 nodurile trimit prin `mail.romfast.ro:465` cu `ups@romfast.ro`.
+
+```bash
+# Test rapid
+ssh echo@10.0.20.201 "echo 'test' | sudo mail -r 'ups@romfast.ro' -s 'test pvemini' mmarius28@gmail.com"
+ssh echo@10.0.20.201 "sudo journalctl -u 'postfix@-' --since '1 min ago' | grep status="
+# Trebuie: status=sent (250 OK ...)
+```
+
+---
+
+## Swap pe pveelite (8GB ZFS zvol)
+
+Adăugat 2026-04-20 anti-OOM (pveelite are 16GB RAM).
+
+```bash
+ssh echo@10.0.20.202 "sudo swapon --show; sudo sysctl vm.swappiness"
+# swappiness: 10 (swap doar sub presiune reală)
+```
+
+---
+
 ## Alertă automată când

 - Container/VM down neașteptat