feat(dr): manual failover/failback scripts + pveelite-down email alert

failover-dr-to-pvemini.sh and failback-dr-to-pveelite.sh promote/demote
the rpool/oracle-backups dataset between nodes when pveelite is down.
Both refuse to run if the other side is reachable to prevent split-brain.
Both patch transfer_backups.ps1 on Oracle Production (10.0.20.36) via
SSH to redirect the daily SCP target between 10.0.20.202 and 10.0.20.201.

The PowerShell patch uses -EncodedCommand (UTF-16LE base64) so the bash
caller does not need to escape PowerShell quoting. End-to-end test
including failover -> failback confirmed transfer_backups.ps1 returns
to byte-identical state (SHA256 43DD2187...).

pveelite-down-alert.sh runs every minute on pvemini and emails an alert
with copy-paste failover instructions after 5 consecutive ping failures.
The alert body includes the latest oracle-backups and VM 109 replica
timestamps so the operator knows the recovery point before deciding.

The DR weekly-test script gains a cluster-aware guard at the top that
exits silently when /etc/pve/qemu-server/109.conf is not on the local
node, allowing the same cron entry to be present on both pveelite and
pvemini without double-firing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Claude Agent
2026-04-25 19:18:51 +00:00
parent a62bcb4331
commit b7fffe1467
4 changed files with 380 additions and 0 deletions

View File

@@ -44,6 +44,14 @@ export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
# Configuration
DR_VM_ID="109"
# Cluster-aware exit: only the node currently hosting VM 109 should run the
# test. With cron deployed on both pveelite (normal home) and pvemini (DR
# failover home), this guard ensures only one instance fires.
if [ ! -f "/etc/pve/qemu-server/${DR_VM_ID}.conf" ] && [ "${1:-}" != "--install" ] && [ "${1:-}" != "--help" ]; then
exit 0
fi
DR_VM_IP="10.0.20.37"
DR_VM_PORT="22122"
DR_VM_USER="romfast"