feat(dr): manual failover/failback scripts + pveelite-down email alert
failover-dr-to-pvemini.sh and failback-dr-to-pveelite.sh promote/demote the rpool/oracle-backups dataset between nodes when pveelite is down. Both refuse to run if the other side is reachable to prevent split-brain. Both patch transfer_backups.ps1 on Oracle Production (10.0.20.36) via SSH to redirect the daily SCP target between 10.0.20.202 and 10.0.20.201. The PowerShell patch uses -EncodedCommand (UTF-16LE base64) so the bash caller does not need to escape PowerShell quoting. End-to-end test including failover -> failback confirmed transfer_backups.ps1 returns to byte-identical state (SHA256 43DD2187...). pveelite-down-alert.sh runs every minute on pvemini and emails an alert with copy-paste failover instructions after 5 consecutive ping failures. The alert body includes the latest oracle-backups and VM 109 replica timestamps so the operator knows the recovery point before deciding. The DR weekly-test script gains a cluster-aware guard at the top that exits silently when /etc/pve/qemu-server/109.conf is not on the local node, allowing the same cron entry to be present on both pveelite and pvemini without double-firing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
98
proxmox/vm109-windows-dr/scripts/pveelite-down-alert.sh
Normal file
98
proxmox/vm109-windows-dr/scripts/pveelite-down-alert.sh
Normal file
@@ -0,0 +1,98 @@
|
||||
#!/bin/bash
|
||||
#
|
||||
# Detects pveelite outage and emails the operator with copy-paste
|
||||
# failover instructions. Runs on pvemini every minute.
|
||||
#
|
||||
# Threshold: 5 consecutive minute failures before alerting (avoids
|
||||
# false positives from short network blips). State is held in
|
||||
# /var/run/pveelite-down-counter so a flap drops back to 0.
|
||||
#
|
||||
# Schedule (cron on pvemini): * * * * * /opt/scripts/pveelite-down-alert.sh
|
||||
|
||||
set -euo pipefail
|
||||
export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
|
||||
|
||||
PVEELITE_IP="10.0.20.202"
|
||||
PVEMINI_IP="10.0.20.201"
|
||||
DATASET="rpool/oracle-backups"
|
||||
COUNTER_FILE="/var/run/pveelite-down-counter"
|
||||
ALERT_SENT_FILE="/var/run/pveelite-down-alerted"
|
||||
ALERT_THRESHOLD=5
|
||||
ALERT_RECIPIENT="${ALERT_RECIPIENT:-root}"
|
||||
|
||||
if ping -c 1 -W 2 "$PVEELITE_IP" >/dev/null 2>&1; then
|
||||
# Reset counter on success and clear "alerted" flag so a future outage re-fires.
|
||||
rm -f "$COUNTER_FILE" "$ALERT_SENT_FILE"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# Failure tick
|
||||
COUNT=$(( $(cat "$COUNTER_FILE" 2>/dev/null || echo 0) + 1 ))
|
||||
echo "$COUNT" >"$COUNTER_FILE"
|
||||
|
||||
[ "$COUNT" -lt "$ALERT_THRESHOLD" ] && exit 0
|
||||
[ -f "$ALERT_SENT_FILE" ] && exit 0 # already alerted this outage
|
||||
|
||||
# Gather diagnostics for the email body
|
||||
LAST_REPL=$(zfs list -t snapshot -o name,creation -s creation 2>/dev/null \
|
||||
| awk -v p="$DATASET@repl_" '$1 ~ p {snap=$1; ts=$2 " " $3 " " $4 " " $5 " " $6} END {print snap " (" ts ")"}')
|
||||
LAST_VM109_REPL=$(zfs list -t snapshot -o name,creation -s creation 2>/dev/null \
|
||||
| awk '/vm-109-disk-1@__replicate_109/ {snap=$1; ts=$2 " " $3 " " $4 " " $5 " " $6} END {print snap " (" ts ")"}')
|
||||
|
||||
cat <<EOF | mail -s "[CRITICAL] pveelite DOWN — DR failover required" "$ALERT_RECIPIENT"
|
||||
pveelite ($PVEELITE_IP) has been unreachable for $COUNT consecutive minutes.
|
||||
|
||||
═══════════════════════════════════════════════════════════════
|
||||
IMPACT
|
||||
═══════════════════════════════════════════════════════════════
|
||||
✗ VM 109 (Oracle DR test) cannot start while pveelite is down,
|
||||
unless you migrate it to pvemini.
|
||||
✗ Oracle backup NFS export at $PVEELITE_IP:/mnt/pve/oracle-backups
|
||||
is unreachable. Primary Oracle (10.0.20.36) SCP transfers will
|
||||
fail and accumulate locally on the Windows source.
|
||||
✗ The next weekly DR test will fail unless storage is failed over.
|
||||
|
||||
═══════════════════════════════════════════════════════════════
|
||||
RECOVERY POINT
|
||||
═══════════════════════════════════════════════════════════════
|
||||
Last oracle-backups ZFS replica on pvemini:
|
||||
$LAST_REPL
|
||||
Last VM 109 disk replica on pvemini:
|
||||
$LAST_VM109_REPL
|
||||
Last rsync mirror on pve1: see /mnt/pve/backup-ssd/oracle-backups-mirror
|
||||
|
||||
═══════════════════════════════════════════════════════════════
|
||||
ACTIVATE FAILOVER
|
||||
═══════════════════════════════════════════════════════════════
|
||||
On pvemini ($PVEMINI_IP):
|
||||
|
||||
ssh root@$PVEMINI_IP
|
||||
/opt/scripts/failover-dr-to-pvemini.sh
|
||||
|
||||
The script will:
|
||||
1. Confirm pveelite is unreachable (refuses to split-brain).
|
||||
2. Flip rpool/oracle-backups on pvemini from readonly to writable.
|
||||
3. Configure NFS export of /mnt/pve/oracle-backups on pvemini.
|
||||
4. SSH to Oracle production (10.0.20.36) and patch
|
||||
D:\\rman_backup\\transfer_backups.ps1 to ship to $PVEMINI_IP.
|
||||
|
||||
If you want to start the next DR test from pvemini before failback:
|
||||
|
||||
ha-manager migrate vm:109 pvemini # if VM 109 is in HA
|
||||
qm start 109 # then start it
|
||||
|
||||
═══════════════════════════════════════════════════════════════
|
||||
WHEN PVEELITE IS BACK
|
||||
═══════════════════════════════════════════════════════════════
|
||||
/opt/scripts/failback-dr-to-pveelite.sh
|
||||
|
||||
═══════════════════════════════════════════════════════════════
|
||||
DETAILS
|
||||
═══════════════════════════════════════════════════════════════
|
||||
Detection time: $(date '+%Y-%m-%d %H:%M:%S')
|
||||
Failures so far: $COUNT consecutive minutes
|
||||
Source: pvemini cron pveelite-down-alert.sh
|
||||
Next check: in 1 minute
|
||||
EOF
|
||||
|
||||
touch "$ALERT_SENT_FILE"
|
||||
Reference in New Issue
Block a user