docs(proxmox): document HA, corosync tuning, diagnostic tools and mail relay

Following the 2026-04-20 cluster outage, the cluster README now covers
HA resource limits, corosync token tuning (10s tolerance for USB glitches),
rasdaemon/netconsole/kdump diagnostic stack on pvemini, mail relay via
mail.romfast.ro with SMTP auth, OOM alerting via cron, and swap on pveelite.

VM 109 README now clearly states it was removed from HA and is only
started by the weekly DR test script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Claude Agent
2026-04-20 11:30:46 +00:00
parent 60c27e7232
commit 1203c24d63
3 changed files with 227 additions and 2 deletions

View File

@@ -7,8 +7,10 @@ proxmox/
├── README.md # Acest fișier (index principal)
├── cluster/ # Infrastructură cluster Proxmox
│ ├── README.md # Ghid SSH și administrare cluster
│ ├── README.md # Ghid SSH, HA, corosync tuning, diagnostic tools, mail relay, OOM alert
│ ├── cluster-ha-monitor.sh # Script monitorizare HA
│ ├── incidents/ # Post-mortems incidente cluster
│ │ └── 2026-04-20-cluster-outage.md # Cascadă OOM pveelite + USB LAN watchdog
│ └── ups/ # Sistem UPS pentru cluster
│ ├── README.md
│ ├── docs/