1 Commits

Author SHA1 Message Date
Claude Agent
62e9926bd4 feat(dr): add cluster + memory pre-flight, deploy VM 109 watchdog
DR test script now refuses to start VM 109 if:
  * cluster is not quorate (e.g. mid-failover into a degraded state),
  * available memory on the host is below VM 109 config + 1 GB margin.

Both checks scale automatically — memory threshold is computed from
qm config so resizing VM 109 does not require touching the script.

Adds vm109-watchdog.sh, scheduled cluster-wide every minute. The
watchdog is the second line of defence behind the cleanup trap from
8a0c557: it force-stops VM 109 if the trap was bypassed (script
killed, host crash mid-test, manual run forgotten). It honours
/var/run/vm109-debug.flag for legitimate manual sessions and is
node-aware via /etc/pve/qemu-server/109.conf so it can be deployed
on every node without coordinating with VM 109's current location.

Both safeguards target the 04-18 → 04-20 chain: VM 109 left running
2.5 days then sandwiched against an HA failover that pushed CT 108
Oracle (8 GB) onto pveelite (16 GB) → OOM cascade.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 18:48:12 +00:00