ROMFASTSQL/proxmox/vm109-windows-dr/scripts/pveelite-down-alert.sh

#!/bin/bash
#
# Detects pveelite outage and emails the operator with copy-paste
# failover instructions. Runs on pvemini every minute.
#
# Threshold: 5 consecutive minute failures before alerting (avoids
# false positives from short network blips). State is held in
# /var/run/pveelite-down-counter so a flap drops back to 0.
#
# Schedule (cron on pvemini): * * * * * /opt/scripts/pveelite-down-alert.sh

set -euo pipefail
export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

PVEELITE_IP="10.0.20.202"
PVEMINI_IP="10.0.20.201"
DATASET="rpool/oracle-backups"
COUNTER_FILE="/var/run/pveelite-down-counter"
ALERT_SENT_FILE="/var/run/pveelite-down-alerted"
ALERT_THRESHOLD=5
ALERT_RECIPIENT="${ALERT_RECIPIENT:-root}"

if ping -c 1 -W 2 "$PVEELITE_IP" >/dev/null 2>&1; then
    # Reset counter on success and clear "alerted" flag so a future outage re-fires.
    rm -f "$COUNTER_FILE" "$ALERT_SENT_FILE"
    exit 0
fi

# Failure tick
COUNT=$(( $(cat "$COUNTER_FILE" 2>/dev/null || echo 0) + 1 ))
echo "$COUNT" >"$COUNTER_FILE"

[ "$COUNT" -lt "$ALERT_THRESHOLD" ] && exit 0
[ -f "$ALERT_SENT_FILE" ] && exit 0   # already alerted this outage

# Gather diagnostics for the email body
LAST_REPL=$(zfs list -t snapshot -o name,creation -s creation 2>/dev/null \
    | awk -v p="$DATASET@repl_" '$1 ~ p {snap=$1; ts=$2 " " $3 " " $4 " " $5 " " $6} END {print snap " (" ts ")"}')
LAST_VM109_REPL=$(zfs list -t snapshot -o name,creation -s creation 2>/dev/null \
    | awk '/vm-109-disk-1@__replicate_109/ {snap=$1; ts=$2 " " $3 " " $4 " " $5 " " $6} END {print snap " (" ts ")"}')

cat <<EOF | mail -s "[CRITICAL] pveelite DOWN — DR failover required" "$ALERT_RECIPIENT"
pveelite ($PVEELITE_IP) has been unreachable for $COUNT consecutive minutes.

═══════════════════════════════════════════════════════════════
IMPACT
═══════════════════════════════════════════════════════════════
  ✗ VM 109 (Oracle DR test) cannot start while pveelite is down,
    unless you migrate it to pvemini.
  ✗ Oracle backup NFS export at $PVEELITE_IP:/mnt/pve/oracle-backups
    is unreachable. Primary Oracle (10.0.20.36) SCP transfers will
    fail and accumulate locally on the Windows source.
  ✗ The next weekly DR test will fail unless storage is failed over.

═══════════════════════════════════════════════════════════════
RECOVERY POINT
═══════════════════════════════════════════════════════════════
  Last oracle-backups ZFS replica on pvemini:
    $LAST_REPL
  Last VM 109 disk replica on pvemini:
    $LAST_VM109_REPL
  Last rsync mirror on pve1: see /mnt/pve/backup-ssd/oracle-backups-mirror

═══════════════════════════════════════════════════════════════
ACTIVATE FAILOVER
═══════════════════════════════════════════════════════════════
On pvemini ($PVEMINI_IP):

    ssh root@$PVEMINI_IP
    /opt/scripts/failover-dr-to-pvemini.sh

The script will:
  1. Confirm pveelite is unreachable (refuses to split-brain).
  2. Flip rpool/oracle-backups on pvemini from readonly to writable.
  3. Configure NFS export of /mnt/pve/oracle-backups on pvemini.
  4. SSH to Oracle production (10.0.20.36) and patch
     D:\\rman_backup\\transfer_backups.ps1 to ship to $PVEMINI_IP.

If you want to start the next DR test from pvemini before failback:

    ha-manager migrate vm:109 pvemini   # if VM 109 is in HA
    qm start 109                         # then start it

═══════════════════════════════════════════════════════════════
WHEN PVEELITE IS BACK
═══════════════════════════════════════════════════════════════
    /opt/scripts/failback-dr-to-pveelite.sh

═══════════════════════════════════════════════════════════════
DETAILS
═══════════════════════════════════════════════════════════════
  Detection time:    $(date '+%Y-%m-%d %H:%M:%S')
  Failures so far:   $COUNT consecutive minutes
  Source:            pvemini cron pveelite-down-alert.sh
  Next check:        in 1 minute
EOF

touch "$ALERT_SENT_FILE"