docs(kb): update infrastructure with HA, corosync tuning, OOM alerting

- Clone romfastsql repo local pe /home/moltbot/workspace/romfastsql/
- Fix: LXC 171 e pe pvemini, nu pveelite
- Adaug secțiuni lipsă: HA groups, corosync token tuning (post-incident 2026-04-20)
- Diagnostic tools: rasdaemon, netconsole, kdump-tools
- OOM alerting, mail notifications, swap pveelite

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-04-26 12:06:29 +00:00
parent e4674b5dda
commit bee409d164

View File

@@ -1,6 +1,8 @@
# Infrastructură (Proxmox + Docker)
> Ultima actualizare: 2026-04-25. Sync cu romfastsql/proxmox/ din Gitea.
> Ultima actualizare: 2026-04-26. Sync cu romfastsql/proxmox/ din Gitea.
> Repo clonat local: `/home/moltbot/workspace/romfastsql/` (HTTPS, fără SSH key)
> Documentație detaliată per LXC/VM: `romfastsql/proxmox/<component>/README.md`
## Acces rapid LXC
@@ -13,7 +15,7 @@
| 106 | gitea | pvemini | 10.0.20.165 | — | `ssh echo@10.0.20.201 "sudo pct exec 106 -- sh"` ⚠️ Alpine (sh, nu bash) |
| 108 | central-oracle | pvemini | 10.0.20.121 | `ssh echo@10.0.20.121` | `ssh echo@10.0.20.201 "sudo pct exec 108 -- bash"` |
| 110 | moltbot | pveelite | 10.0.20.173 | `ssh moltbot@10.0.20.173` | `ssh echo@10.0.20.202 "sudo pct exec 110 -- bash"` |
| 171 | claude-agent | pveelite | 10.0.20.171 | `ssh claude@10.0.20.171` | `ssh echo@10.0.20.202 "sudo pct exec 171 -- bash"` |
| 171 | claude-agent | pvemini ⚠️ | 10.0.20.171 | `ssh claude@10.0.20.171` | `ssh echo@10.0.20.201 "sudo pct exec 171 -- bash"` |
---
@@ -343,7 +345,7 @@ ssh echo@10.0.20.201 "sudo qm status 302"
### pvemini (10.0.20.201) — host principal
- **Resurse:** 64GB RAM, 1.4TB disk
- **LXC-uri:** 100(running), 103(running), 104(running), 105(stopped), 106(running), 108(running)
- **LXC-uri:** 100(running), 103(running), 104(running), 105(stopped), 106(running), 108(running), 171(running)
- **VM-uri:** 201(running), 300(stopped — Windows 11 template), 302(stopped — oracle test)
- **Backup zilnic 02:00:** LXC 100, 104, 106, 108, VM 201 → storage "backup"
@@ -356,10 +358,10 @@ ssh echo@10.0.20.201 "sudo qm status 302"
- `vm107-monitor.sh` — monitorizare VM 107
### pveelite (10.0.20.202)
- **Resurse:** 16GB RAM, 557GB disk (+ 8GB ZFS swap)
- **LXC-uri:** 101(running), 105(stopped), 110(running), 171(running), 301(stopped)
- **Resurse:** 16GB RAM, 557GB disk (+ 8GB ZFS swap — adăugat 2026-04-20 anti-OOM)
- **LXC-uri:** 101(running), 105(stopped), 110(running), 301(stopped)
- **VM-uri:** 109(stopped — oracle DR)
- **Backup zilnic 22:00:** LXC 101, 110, 171 → backup-pvemini-nfs
- **Backup zilnic 22:00:** LXC 101, 110 → backup-pvemini-nfs
**Scripturi `/opt/scripts/`:**
- `oracle-backup-monitor-proxmox.sh` — zilnic 21:00, verifică backup Oracle
@@ -380,6 +382,112 @@ ssh echo@10.0.20.201 "sudo qm status 302"
---
## High Availability (HA)
**Grupuri HA:**
```
ha-group-main → pvemini (100), pveelite (50), pve1 (33)
ha-group-elite → pveelite (100), pve1 (33), pvemini (50)
```
**Resurse HA active:**
| Resursă | Grup | Max restart | Max relocate | Notă |
|---------|------|-------------|--------------|------|
| ct:100 portainer | ha-group-main | 3 | 3 | |
| ct:101 minecraft | ha-group-elite | 3 | 3 | Rulează pe pveelite |
| ct:104 flowise | ha-group-main | 3 | 2 | Limite adăugate 2026-04-20 |
| ct:106 gitea | ha-group-main | 3 | 3 | |
| ct:108 central-oracle | ha-group-main | 3 | 2 | Limite adăugate 2026-04-20 |
**VM 109 NU mai e în HA** — scos 2026-04-20 după buclă OOM. Pornit exclusiv manual (DR test săptămânal sâmbătă 06:00).
```bash
# Verificare HA
ssh echo@10.0.20.201 "sudo ha-manager status"
# Modificare limite (exemplu)
ssh echo@10.0.20.201 "sudo ha-manager set ct:108 --max_restart 3 --max_relocate 2"
```
---
## Corosync Tuning (post-incident 2026-04-20)
Token mărit la 10000ms (default: 1000ms) — tolerează USB disconnect scurt pe pveelite fără reboot forțat.
```bash
# Verificare
ssh echo@10.0.20.201 "sudo corosync-cmapctl | grep 'totem.token '"
# runtime.config.totem.token (u32) = 10650
# totem.token (u32) = 10000
```
---
## Diagnostic Tools (instalate 2026-04-20)
### rasdaemon — MCE + PCIe AER monitoring
```bash
ssh echo@10.0.20.201 "sudo ras-mc-ctl --summary"
```
### netconsole — kernel logs → pve1
Dacă pvemini crashează hard, ultimele linii kernel se găsesc pe pve1:
```bash
ssh echo@10.0.20.200 "sudo tail /var/log/netconsole-pvemini.log"
ssh echo@10.0.20.200 "sudo systemctl status netconsole-receiver"
```
### kdump-tools — captură crash dump
```bash
ssh echo@10.0.20.201 "sudo systemctl is-active kdump-tools"
# Dump-uri la crash: /var/crash/ pe pvemini
```
### kernel.panic auto-reboot
```bash
ssh echo@10.0.20.201 "sudo sysctl kernel.panic"
# kernel.panic = 10 → auto-reboot după 10s la kernel panic
```
---
## OOM Alerting
Script `/opt/scripts/oom-alert.sh` pe toate 3 nodurile — cron la 1 minut — trimite mail la mmarius28@gmail.com dacă detectează OOM kill.
```bash
# Verificare instalat pe toate nodurile
for ip in 10.0.20.200 10.0.20.201 10.0.20.202; do
ssh echo@$ip "sudo crontab -l | grep oom-alert"
done
```
---
## Mail Notifications (Proxmox → mail.romfast.ro)
Toate 3 nodurile trimit prin `mail.romfast.ro:465` cu `ups@romfast.ro`.
```bash
# Test rapid
ssh echo@10.0.20.201 "echo 'test' | sudo mail -r 'ups@romfast.ro' -s 'test pvemini' mmarius28@gmail.com"
ssh echo@10.0.20.201 "sudo journalctl -u 'postfix@-' --since '1 min ago' | grep status="
# Trebuie: status=sent (250 OK ...)
```
---
## Swap pe pveelite (8GB ZFS zvol)
Adăugat 2026-04-20 anti-OOM (pveelite are 16GB RAM).
```bash
ssh echo@10.0.20.202 "sudo swapon --show; sudo sysctl vm.swappiness"
# swappiness: 10 (swap doar sub presiune reală)
```
---
## Alertă automată când
- Container/VM down neașteptat