Commit Graph

2 Commits

Author SHA1 Message Date
Marius
3ded5d3f2f docs(cluster): incident pvemini backup SSD hang + thermal monitoring
Documenteaza incidentul Kingston SNV3S2000G hang la 2026-04-30 (Sensor 2
74°C → emergency mode + restart loop) si masurile aplicate: distantare
temporala backup-uri par/impar, mutare CT 101+110 pe pve1 backup-ssd,
nofail in fstab, hardware watchdog iTCO_wdt, monitoring CSV la 30 min.

Adauga scripturile /opt/scripts/kingston-thermal-{monitor,report}.sh
pentru tracking trend si alertare la depasirea pragurilor termale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 01:04:02 +03:00
Claude Agent
60c27e7232 fix(vm109-dr): trap cleanup to stop VM 109 on script exit
The DR test script used set -euo pipefail, so a failing SSH
shutdown command caused the script to exit before qm stop.
On 2026-04-20 this left VM 109 running for 2.5 days and
triggered an OOM cascade when pvemini HA-failed over to
pveelite.

Adds EXIT trap that force-stops VM 109 regardless of exit
path, and makes the Step 7 SSH shutdown tolerant of failure.
Incident details: proxmox/cluster/incidents/2026-04-20-cluster-outage.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 11:16:04 +00:00