docs(dr): refresh README with post-2026-04-25 architecture

VM 109 returned to its original home on pveelite, co-located with oracle-backups NFS storage. The README is updated to reflect that: the VM is now in HA (ha-prefer-pveelite, state=stopped, nofailback=1) rather than excluded from HA, and the new layered defences (trap guard, watchdog cron, dynamic memory pre-flight, max_restart caps) are documented alongside the original 8a0c557 trap. Adds a Storage Failover section describing the pveelite -> pvemini manual failover flow: email alert from pveelite-down-alert.sh, failover-dr-to-pvemini.sh on the surviving node, failback when pveelite returns. The pve1 nightly mirror is the third copy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(dr): manual failover/failback scripts + pveelite-down email alert
2026-04-25 19:19:38 +00:00 · 2026-04-25 19:18:51 +00:00 · 2026-04-25 19:00:04 +00:00 · 2026-04-25 18:48:12 +00:00 · 2026-04-25 08:47:54 +00:00
8 changed files with 681 additions and 21 deletions
--- a/proxmox/vm109-windows-dr/README.md
+++ b/proxmox/vm109-windows-dr/README.md
@@ -4,26 +4,33 @@
 **VMID:** 109
 **Rol:** Disaster Recovery pentru Oracle Database (backup RMAN de pe server Windows extern)

-## ⚠️ Important — VM 109 NU este în HA (din 2026-04-20)
+## ⚠️ Important — Topologie după 2026-04-25

-După incidentul 2026-04-20 (vezi `../cluster/incidents/2026-04-20-cluster-outage.md`), VM 109 a fost scos din HA cu `ha-manager remove vm:109`. Motivele:
+VM 109 trăiește pe **pveelite** (10.0.20.202), co-located cu storage-ul NFS Oracle backups. Configurația post-incident 04-20:

- VM 109 este un DR test VM, nu un serviciu live
- Scriptul DR test de sâmbătă (`scripts/weekly-dr-test-proxmox.sh`) pornește/oprește VM 109 manual cu `qm start/stop`
- Cu HA activ, un bug `set -e` în script a lăsat VM 109 pornit 2.5 zile, apoi la crashul pvemini HA a relocat VM 109 pe pveelite (16 GB) → OOM cascade
-
-**Efecte:**
- VM 109 NU mai e repornit automat la crash node
- VM 109 NU se mai mută de pe pvemini
- VM 109 pornește DOAR la invocarea scriptului DR sau manual cu `qm start 109`
- Scriptul DR are acum `trap cleanup_vm EXIT` care garantează `qm stop 109` la orice ieșire
+- **VM 109 în HA, grup `ha-prefer-pveelite`** (pveelite=100, pvemini=50, pve1=10), `state=stopped`, `nofailback=1` — HA face failover dacă pveelite cade dar nu repornește VM 109 automat (rămâne stopped, scriptul DR îl pornește săptămânal).
+- **Apărări împotriva incident 04-20**:
+  - `trap cleanup_vm EXIT` în scriptul DR (commit 8a0c557) cu guard `DR_VM_STARTED_BY_US` (commit 2e8cd9c) — oprește VM 109 doar dacă scriptul l-a pornit.
+  - `vm109-watchdog.sh` cron pe ambele pveelite + pvemini (cluster-aware) — oprește forțat VM 109 dacă rulează > 60 min în afara ferestrei test (Sâmbătă 05:55-07:30). Debug exempt: `touch /var/run/vm109-debug.flag`.
+  - Pre-flight check în DR script: refuză `qm start 109` dacă cluster degraded sau memorie disponibilă < (VM 109 mem + 1 GB margin).
+  - `max_restart=3, max_relocate=2` pe toate serviciile HA — cap pe restart loops la OOM.

 **Verificare status:**
 ```bash
-ssh root@10.0.20.201 "qm status 109"          # trebuie stopped
-ssh root@10.0.20.201 "ha-manager status | grep 109 || echo 'nu e în HA'"
+ssh root@10.0.20.201 "ha-manager status | grep -E '109|201|108'"
+ssh root@10.0.20.202 "qm status 109"          # trebuie stopped între teste
 ```

+## 🔄 Storage Failover (pveelite → pvemini)
+
+`/mnt/pve/oracle-backups` e dataset ZFS replicat pveelite → pvemini la 15 min (`zfs-replicate-oracle-backups.sh`) + mirror nightly pe pve1 backup-ssd (`nightly-backup-mirror.sh`). La pveelite down:
+
+1. **Email automat** din `pveelite-down-alert.sh` (cron pe pvemini, prag 5 min) cu instrucțiuni de failover copy-paste.
+2. Operator rulează pe pvemini: `/opt/scripts/failover-dr-to-pvemini.sh` — promote ZFS readonly → off, configurează NFS export, patch primary Oracle scheduled task IP via SSH.
+3. Când pveelite revine: `/opt/scripts/failback-dr-to-pveelite.sh` — invers, cu zfs send incremental + restaurare config.
+
+Script-urile refuză să ruleze dacă cealaltă parte e accesibilă (anti-split-brain).
+
 ---

 # 🛡️ Oracle DR System - Complete Architecture
--- a/proxmox/vm109-windows-dr/scripts/failback-dr-to-pveelite.sh
+++ b/proxmox/vm109-windows-dr/scripts/failback-dr-to-pveelite.sh
@@ -0,0 +1,142 @@
+#!/bin/bash
+#
+# Failback Oracle DR storage from pvemini back to pveelite.
+#
+# When to run: pveelite has been brought back online and you want to
+# return to the normal topology (pveelite = active, pvemini = readonly
+# replica). Inverse of failover-dr-to-pvemini.sh.
+#
+# Sequence:
+#   1. Confirm pveelite reachable.
+#   2. Snapshot current writable state on pvemini.
+#   3. Send the snapshot to pveelite (overwrites stale state there).
+#   4. Stop NFS on pvemini, remove its export entry.
+#   5. Set pvemini readonly=on (back to replica role).
+#   6. On pveelite: zfs recv finalisation, set readonly=off, restart NFS.
+#   7. Patch transfer_backups.ps1 on Oracle Windows back to pveelite IP.
+#   8. Re-arm replication cron (which already lives on pveelite).
+#
+# This script orchestrates from pvemini so it can SSH outward to pveelite.
+
+set -euo pipefail
+export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
+
+PVEELITE_IP="10.0.20.202"
+PVEMINI_IP="10.0.20.201"
+DATASET="rpool/oracle-backups"
+MOUNTPOINT="/mnt/pve/oracle-backups"
+NFS_CLIENT="10.0.20.37"
+NFS_OPTS="rw,sync,no_subtree_check,no_root_squash"
+PRIMARY_HOST="10.0.20.36"
+PRIMARY_USER="dr-failover"
+PRIMARY_SSH_PORT="22122"
+TRANSFER_SCRIPT_WIN_PATH='D:\rman_backup\transfer_backups.ps1'
+LOG="/var/log/oracle-dr/failover.log"
+SSH_OPTS_PVE="-o UserKnownHostsFile=/etc/pve/priv/known_hosts -o StrictHostKeyChecking=no -o BatchMode=yes"
+
+mkdir -p "$(dirname "$LOG")"
+log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
+
+if [ "$(hostname)" != "pvemini" ]; then
+    log "FATAL: this script must run on pvemini (current: $(hostname))"
+    exit 1
+fi
+
+log "============================================================"
+log "Oracle DR failback: pvemini -> pveelite"
+log "============================================================"
+
+# Step 1: verify pveelite reachable
+log "Step 1: verifying pveelite reachable"
+if ! ping -c 3 -W 2 "$PVEELITE_IP" >/dev/null 2>&1; then
+    log "ABORT: pveelite is still unreachable."
+    exit 2
+fi
+if ! ssh $SSH_OPTS_PVE "root@$PVEELITE_IP" "true" 2>/dev/null; then
+    log "ABORT: pveelite SSH not responding."
+    exit 2
+fi
+log "  pveelite reachable."
+
+# Step 2: take a final snapshot on pvemini before handing back
+SNAP="${DATASET}@failback_$(date +%Y%m%d_%H%M%S)"
+log "Step 2: snapshot $SNAP"
+zfs snapshot "$SNAP"
+
+# Step 3: send to pveelite
+log "Step 3: sending snapshot to pveelite (incremental from latest common)"
+COMMON_BASE=$(comm -12 \
+    <(zfs list -H -t snapshot -o name "$DATASET" | sed "s|^$DATASET@||" | sort) \
+    <(ssh $SSH_OPTS_PVE "root@$PVEELITE_IP" "zfs list -H -t snapshot -o name $DATASET 2>/dev/null | sed 's|^$DATASET@||' | sort") \
+    | tail -1)
+
+if [ -z "$COMMON_BASE" ]; then
+    log "  no common snapshot — refusing to do full send (would destroy pveelite state)."
+    log "  Manual recovery required. Inspect: zfs list -t snapshot $DATASET on both nodes."
+    exit 3
+fi
+log "  common base: $DATASET@$COMMON_BASE"
+log "  sending ${DATASET}@${COMMON_BASE} -> $SNAP to pveelite"
+zfs send -i "${DATASET}@${COMMON_BASE}" "$SNAP" \
+    | ssh $SSH_OPTS_PVE "root@$PVEELITE_IP" "zfs recv -F $DATASET" 2>&1 | tee -a "$LOG"
+
+# Step 4: stop NFS on pvemini, remove export
+log "Step 4: stopping NFS on pvemini"
+EXPORT_LINE="$MOUNTPOINT $NFS_CLIENT($NFS_OPTS)"
+if grep -qF "$EXPORT_LINE" /etc/exports; then
+    sed -i "\#$EXPORT_LINE#d" /etc/exports
+    log "  export removed from /etc/exports"
+fi
+exportfs -ra
+# Only stop NFS server if no other exports remain
+if [ -z "$(exportfs -v 2>/dev/null)" ]; then
+    systemctl stop nfs-server
+    log "  nfs-server stopped (no other exports)"
+fi
+
+# Step 5: pvemini back to readonly replica
+log "Step 5: setting pvemini dataset readonly=on"
+zfs set readonly=on "$DATASET"
+
+# Step 6: pveelite take over as primary
+log "Step 6: activating pveelite as primary"
+ssh $SSH_OPTS_PVE "root@$PVEELITE_IP" "
+    set -e
+    zfs set readonly=off $DATASET
+    systemctl is-enabled --quiet nfs-server || systemctl enable nfs-server
+    systemctl is-active  --quiet nfs-server || systemctl start  nfs-server
+    exportfs -ra
+    exportfs -v
+" 2>&1 | tee -a "$LOG"
+
+# Step 7: patch primary Oracle script back (literal Replace via PS EncodedCommand)
+log "Step 7: patching $TRANSFER_SCRIPT_WIN_PATH back to $PVEELITE_IP"
+PS_SCRIPT="\$path = '$TRANSFER_SCRIPT_WIN_PATH'
+\$old  = '\"$PVEMINI_IP\"'
+\$new  = '\"$PVEELITE_IP\"'
+\$content = Get-Content \$path -Raw
+if (\$content.Contains(\$old)) {
+    Set-Content -Path \$path -Value \$content.Replace(\$old, \$new) -NoNewline
+    Write-Output 'PATCHED_BACK'
+} elseif (\$content.Contains(\$new)) {
+    Write-Output 'ALREADY_AT_PVEELITE'
+} else {
+    Write-Output 'UNKNOWN_DRHost_VALUE'
+}"
+PS_B64=$(printf '%s' "$PS_SCRIPT" | iconv -t UTF-16LE | base64 -w0)
+PATCH_RESULT=$(ssh -p "$PRIMARY_SSH_PORT" -o ConnectTimeout=10 -o BatchMode=yes \
+        "$PRIMARY_USER@$PRIMARY_HOST" \
+        "powershell -NoProfile -EncodedCommand $PS_B64" 2>&1 \
+        | grep -vE '^#< CLIXML|<Objs |</Objs>$' | tr -d '\r' | head -1)
+if [ -n "$PATCH_RESULT" ]; then
+    log "  result: $PATCH_RESULT"
+else
+    log "  WARNING: SSH to primary failed — edit \$DRHost = \"$PVEELITE_IP\" manually"
+fi
+
+# Step 8: replication cron on pveelite is unchanged, will resume on schedule
+log "Step 8: replication cron on pveelite resumes automatically (*/15)"
+
+log "============================================================"
+log "Failback complete. pveelite is again the active NFS source."
+log "============================================================"
--- a/proxmox/vm109-windows-dr/scripts/failover-dr-to-pvemini.sh
+++ b/proxmox/vm109-windows-dr/scripts/failover-dr-to-pvemini.sh
@@ -0,0 +1,132 @@
+#!/bin/bash
+#
+# Failover the Oracle DR storage from pveelite to pvemini.
+#
+# When to run: pveelite is dead long enough that the user has chosen to
+# take over backup ingestion on pvemini rather than wait. The
+# pveelite-down email alert points the operator at this script.
+#
+# What it does:
+#   1. Confirms pveelite is actually unreachable (refuses to split-brain).
+#   2. Flips rpool/oracle-backups on pvemini from readonly replica to
+#      writable primary.
+#   3. Configures and starts the NFS export on pvemini so VM 109 can
+#      still mount /mnt/pve/oracle-backups when it boots there.
+#   4. Patches transfer_backups.ps1 on the Oracle Windows production
+#      host (10.0.20.36) to ship to pvemini's IP instead of pveelite's.
+#   5. Disables the original ZFS replication cron (which would now fail
+#      since the source pveelite is down).
+#   6. Prints next steps for the operator.
+#
+# Idempotent: rerunning is safe — each step checks before acting.
+#
+# Reverse: /opt/scripts/failback-dr-to-pveelite.sh once pveelite is back.
+
+set -euo pipefail
+export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
+
+PVEELITE_IP="10.0.20.202"
+PVEMINI_IP="10.0.20.201"
+DATASET="rpool/oracle-backups"
+MOUNTPOINT="/mnt/pve/oracle-backups"
+NFS_CLIENT="10.0.20.37"     # VM 109 NFS client
+NFS_OPTS="rw,sync,no_subtree_check,no_root_squash"
+PRIMARY_HOST="10.0.20.36"
+PRIMARY_USER="dr-failover"
+PRIMARY_SSH_PORT="22122"
+TRANSFER_SCRIPT_WIN_PATH='D:\rman_backup\transfer_backups.ps1'
+LOG="/var/log/oracle-dr/failover.log"
+SSH_OPTS_PVE="-o UserKnownHostsFile=/etc/pve/priv/known_hosts -o StrictHostKeyChecking=no -o BatchMode=yes"
+
+mkdir -p "$(dirname "$LOG")"
+
+log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
+
+if [ "$(hostname)" != "pvemini" ]; then
+    log "FATAL: this script must run on pvemini (current: $(hostname))"
+    exit 1
+fi
+
+log "============================================================"
+log "Oracle DR failover: pveelite -> pvemini"
+log "============================================================"
+
+# Step 1: confirm pveelite is unreachable
+log "Step 1: verifying pveelite ($PVEELITE_IP) is unreachable..."
+if ping -c 3 -W 2 "$PVEELITE_IP" >/dev/null 2>&1; then
+    log "ABORT: pveelite responds to ping. Refusing to split-brain."
+    log "  If you really want to force failover anyway:"
+    log "    1. Confirm pveelite NFS service is dead (systemctl status nfs-server)"
+    log "    2. Stop pveelite NFS first: ssh pveelite 'systemctl stop nfs-server'"
+    log "    3. Then re-run this script."
+    exit 2
+fi
+log "  pveelite unreachable, proceeding."
+
+# Step 2: flip dataset to writable
+CURRENT_RO=$(zfs get -H -o value readonly "$DATASET")
+log "Step 2: dataset readonly status = $CURRENT_RO"
+if [ "$CURRENT_RO" = "on" ]; then
+    log "  setting readonly=off on $DATASET"
+    zfs set readonly=off "$DATASET"
+else
+    log "  already writable, no change"
+fi
+
+# Step 3: NFS export
+log "Step 3: configuring NFS export on pvemini"
+EXPORT_LINE="$MOUNTPOINT $NFS_CLIENT($NFS_OPTS)"
+if grep -qF "$EXPORT_LINE" /etc/exports; then
+    log "  export already present in /etc/exports"
+else
+    log "  appending export line"
+    echo "$EXPORT_LINE" >> /etc/exports
+fi
+systemctl is-enabled --quiet nfs-server || systemctl enable nfs-server
+systemctl is-active  --quiet nfs-server || systemctl start  nfs-server
+exportfs -ra
+log "  active exports:"
+exportfs -v 2>&1 | sed 's/^/    /' | tee -a "$LOG"
+
+# Step 4: patch primary Oracle transfer script.
+# Use literal String.Replace (no regex). Send via PowerShell -EncodedCommand
+# (UTF-16LE base64) to bypass all bash <-> SSH <-> PowerShell quoting issues.
+log "Step 4: patching $TRANSFER_SCRIPT_WIN_PATH on $PRIMARY_HOST"
+PS_SCRIPT="\$path = '$TRANSFER_SCRIPT_WIN_PATH'
+\$old  = '\"$PVEELITE_IP\"'
+\$new  = '\"$PVEMINI_IP\"'
+\$content = Get-Content \$path -Raw
+if (\$content.Contains(\$old)) {
+    Set-Content -Path \$path -Value \$content.Replace(\$old, \$new) -NoNewline
+    Write-Output 'PATCHED'
+} elseif (\$content.Contains(\$new)) {
+    Write-Output 'ALREADY_FAILED_OVER'
+} else {
+    Write-Output 'UNKNOWN_DRHost_VALUE'
+}"
+PS_B64=$(printf '%s' "$PS_SCRIPT" | iconv -t UTF-16LE | base64 -w0)
+PATCH_RESULT=$(ssh -p "$PRIMARY_SSH_PORT" -o ConnectTimeout=10 -o BatchMode=yes \
+        "$PRIMARY_USER@$PRIMARY_HOST" \
+        "powershell -NoProfile -EncodedCommand $PS_B64" 2>&1 \
+        | grep -vE '^#< CLIXML|<Objs |</Objs>$' | tr -d '\r' | head -1)
+if [ -n "$PATCH_RESULT" ]; then
+    log "  result: $PATCH_RESULT"
+else
+    log "  WARNING: SSH to primary failed — operator must edit $TRANSFER_SCRIPT_WIN_PATH manually"
+    log "  Set: \$DRHost = \"$PVEMINI_IP\""
+fi
+
+# Step 5: disable original replication cron entry locally too
+# (it lives on pveelite; nothing to do here, but document)
+log "Step 5: ZFS replication cron is on pveelite which is down — no action needed"
+
+# Step 6: print next steps
+log "============================================================"
+log "Failover complete on pvemini."
+log "Next steps for the operator:"
+log "  1. Verify VM 109 starts here if a DR test is needed:"
+log "       qm start 109   (once HA migrates VM 109 to pvemini, or manually)"
+log "  2. Watch the next scheduled Oracle backup land on pvemini:"
+log "       tail -f /var/log/syslog | grep nfsd"
+log "  3. When pveelite returns, run /opt/scripts/failback-dr-to-pveelite.sh"
+log "============================================================"
--- a/proxmox/vm109-windows-dr/scripts/nightly-backup-mirror.sh
+++ b/proxmox/vm109-windows-dr/scripts/nightly-backup-mirror.sh
@@ -0,0 +1,63 @@
+#!/bin/bash
+#
+# Nightly mirror of Oracle backups + cluster config to pve1's backup-ssd.
+#
+# Why two redundant copies are not enough:
+#   * ZFS replica pveelite -> pvemini covers pveelite hardware failure.
+#   * If both pveelite AND pvemini are down (rare but possible — common
+#     storage controller, network rack, electrical fault), pve1 is the
+#     last copy. Keeping it on a different physical disk type (SATA
+#     ext4) further insulates against ZFS-on-NVMe-specific failures.
+#   * /etc/pve is in pmxcfs (in-RAM, replicated cluster-wide). If
+#     quorum is lost on multiple nodes simultaneously the config is
+#     unrecoverable without a backup.
+#
+# Schedule (cron on pveelite): 0 4 * * *
+
+set -euo pipefail
+export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
+
+PVE1_HOST="10.0.20.200"
+PVE1_BACKUP_DIR="/mnt/pve/backup-ssd"
+ORACLE_SRC="/mnt/pve/oracle-backups/"
+ORACLE_DST="${PVE1_BACKUP_DIR}/oracle-backups-mirror/"
+PVE_CFG_DST="${PVE1_BACKUP_DIR}/pve-config-backups"
+LOG="/var/log/oracle-dr/nightly-mirror.log"
+SSH_OPTS="-o UserKnownHostsFile=/etc/pve/priv/known_hosts -o StrictHostKeyChecking=no -o BatchMode=yes"
+KEEP_PVE_CONFIGS=14   # 2 weeks of nightly /etc/pve archives
+
+mkdir -p "$(dirname "$LOG")"
+
+log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" >>"$LOG"; }
+
+log "=== Starting nightly mirror ==="
+
+# 1. Rsync Oracle backups to pve1
+log "Rsync ${ORACLE_SRC} -> ${PVE1_HOST}:${ORACLE_DST}"
+if rsync -aHX --delete -e "ssh ${SSH_OPTS}" \
+        "${ORACLE_SRC}" "root@${PVE1_HOST}:${ORACLE_DST}" 2>>"$LOG"; then
+    log "Oracle backups rsync OK"
+else
+    log "ERROR: Oracle backups rsync failed"
+fi
+
+# 2. Tar /etc/pve and ship to pve1
+TS=$(date +%Y%m%d_%H%M%S)
+ARCHIVE="pve-config-${TS}.tar.gz"
+log "Tar /etc/pve -> ${PVE1_HOST}:${PVE_CFG_DST}/${ARCHIVE}"
+if tar czf - -C / etc/pve 2>/dev/null | \
+        ssh ${SSH_OPTS} "root@${PVE1_HOST}" \
+            "cat > '${PVE_CFG_DST}/${ARCHIVE}'" 2>>"$LOG"; then
+    log "pve-config tar OK ($(ssh ${SSH_OPTS} root@${PVE1_HOST} \
+        "stat -c %s '${PVE_CFG_DST}/${ARCHIVE}'") bytes)"
+else
+    log "ERROR: pve-config tar failed"
+fi
+
+# 3. Prune old pve-config archives on pve1 (keep last KEEP_PVE_CONFIGS)
+ssh ${SSH_OPTS} "root@${PVE1_HOST}" "
+    cd '${PVE_CFG_DST}' && \
+    ls -1t pve-config-*.tar.gz 2>/dev/null | tail -n +$((KEEP_PVE_CONFIGS + 1)) | xargs -r rm -v
+" >>"$LOG" 2>&1 || true
+
+log "=== Nightly mirror completed ==="
--- a/proxmox/vm109-windows-dr/scripts/pveelite-down-alert.sh
+++ b/proxmox/vm109-windows-dr/scripts/pveelite-down-alert.sh
@@ -0,0 +1,98 @@
+#!/bin/bash
+#
+# Detects pveelite outage and emails the operator with copy-paste
+# failover instructions. Runs on pvemini every minute.
+#
+# Threshold: 5 consecutive minute failures before alerting (avoids
+# false positives from short network blips). State is held in
+# /var/run/pveelite-down-counter so a flap drops back to 0.
+#
+# Schedule (cron on pvemini): * * * * * /opt/scripts/pveelite-down-alert.sh
+
+set -euo pipefail
+export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
+
+PVEELITE_IP="10.0.20.202"
+PVEMINI_IP="10.0.20.201"
+DATASET="rpool/oracle-backups"
+COUNTER_FILE="/var/run/pveelite-down-counter"
+ALERT_SENT_FILE="/var/run/pveelite-down-alerted"
+ALERT_THRESHOLD=5
+ALERT_RECIPIENT="${ALERT_RECIPIENT:-root}"
+
+if ping -c 1 -W 2 "$PVEELITE_IP" >/dev/null 2>&1; then
+    # Reset counter on success and clear "alerted" flag so a future outage re-fires.
+    rm -f "$COUNTER_FILE" "$ALERT_SENT_FILE"
+    exit 0
+fi
+
+# Failure tick
+COUNT=$(( $(cat "$COUNTER_FILE" 2>/dev/null || echo 0) + 1 ))
+echo "$COUNT" >"$COUNTER_FILE"
+
+[ "$COUNT" -lt "$ALERT_THRESHOLD" ] && exit 0
+[ -f "$ALERT_SENT_FILE" ] && exit 0   # already alerted this outage
+
+# Gather diagnostics for the email body
+LAST_REPL=$(zfs list -t snapshot -o name,creation -s creation 2>/dev/null \
+    | awk -v p="$DATASET@repl_" '$1 ~ p {snap=$1; ts=$2 " " $3 " " $4 " " $5 " " $6} END {print snap " (" ts ")"}')
+LAST_VM109_REPL=$(zfs list -t snapshot -o name,creation -s creation 2>/dev/null \
+    | awk '/vm-109-disk-1@__replicate_109/ {snap=$1; ts=$2 " " $3 " " $4 " " $5 " " $6} END {print snap " (" ts ")"}')
+
+cat <<EOF | mail -s "[CRITICAL] pveelite DOWN — DR failover required" "$ALERT_RECIPIENT"
+pveelite ($PVEELITE_IP) has been unreachable for $COUNT consecutive minutes.
+
+═══════════════════════════════════════════════════════════════
+IMPACT
+═══════════════════════════════════════════════════════════════
+  ✗ VM 109 (Oracle DR test) cannot start while pveelite is down,
+    unless you migrate it to pvemini.
+  ✗ Oracle backup NFS export at $PVEELITE_IP:/mnt/pve/oracle-backups
+    is unreachable. Primary Oracle (10.0.20.36) SCP transfers will
+    fail and accumulate locally on the Windows source.
+  ✗ The next weekly DR test will fail unless storage is failed over.
+
+═══════════════════════════════════════════════════════════════
+RECOVERY POINT
+═══════════════════════════════════════════════════════════════
+  Last oracle-backups ZFS replica on pvemini:
+    $LAST_REPL
+  Last VM 109 disk replica on pvemini:
+    $LAST_VM109_REPL
+  Last rsync mirror on pve1: see /mnt/pve/backup-ssd/oracle-backups-mirror
+
+═══════════════════════════════════════════════════════════════
+ACTIVATE FAILOVER
+═══════════════════════════════════════════════════════════════
+On pvemini ($PVEMINI_IP):
+
+    ssh root@$PVEMINI_IP
+    /opt/scripts/failover-dr-to-pvemini.sh
+
+The script will:
+  1. Confirm pveelite is unreachable (refuses to split-brain).
+  2. Flip rpool/oracle-backups on pvemini from readonly to writable.
+  3. Configure NFS export of /mnt/pve/oracle-backups on pvemini.
+  4. SSH to Oracle production (10.0.20.36) and patch
+     D:\\rman_backup\\transfer_backups.ps1 to ship to $PVEMINI_IP.
+
+If you want to start the next DR test from pvemini before failback:
+
+    ha-manager migrate vm:109 pvemini   # if VM 109 is in HA
+    qm start 109                         # then start it
+
+═══════════════════════════════════════════════════════════════
+WHEN PVEELITE IS BACK
+═══════════════════════════════════════════════════════════════
+    /opt/scripts/failback-dr-to-pveelite.sh
+
+═══════════════════════════════════════════════════════════════
+DETAILS
+═══════════════════════════════════════════════════════════════
+  Detection time:    $(date '+%Y-%m-%d %H:%M:%S')
+  Failures so far:   $COUNT consecutive minutes
+  Source:            pvemini cron pveelite-down-alert.sh
+  Next check:        in 1 minute
+EOF
+
+touch "$ALERT_SENT_FILE"
--- a/proxmox/vm109-windows-dr/scripts/vm109-watchdog.sh
+++ b/proxmox/vm109-windows-dr/scripts/vm109-watchdog.sh
@@ -0,0 +1,99 @@
+#!/bin/bash
+#
+# VM 109 watchdog: stops VM 109 if running outside the DR test window.
+#
+# Why: incident 2026-04-18 — DR script crashed after starting VM 109 but
+# before stopping it. Trap was added (commit 8a0c557) but only fires on
+# script exit, not on system crash, kernel panic, or oomkill of the test
+# script itself. This watchdog is the second line of defense.
+#
+# Behavior:
+#   * If VM 109 is not running: exit silently.
+#   * If VM 109 is running and uptime <= 60 min: exit silently (test running).
+#   * If VM 109 is running, uptime > 60 min, debug flag absent, and we are
+#     OUTSIDE Saturday 05:55-07:30 EEST: alert + stop VM 109.
+#
+# Debug exemption:
+#   touch /var/run/vm109-debug.flag    # before manual debug
+#   rm    /var/run/vm109-debug.flag    # after debug
+#
+# Schedule (cron on the node hosting VM 109):
+#   * * * * * /opt/scripts/vm109-watchdog.sh
+
+set -euo pipefail
+export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
+
+DR_VM_ID="109"
+DEBUG_FLAG="/var/run/vm109-debug.flag"
+LOG="/var/log/oracle-dr/watchdog.log"
+MAX_RUNTIME_S=3600   # 60 minutes outside test window
+TEST_WINDOW_START_MIN=$((5 * 60 + 55))   # Saturday 05:55
+TEST_WINDOW_END_MIN=$((7 * 60 + 30))     # Saturday 07:30
+
+mkdir -p "$(dirname "$LOG")"
+
+log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" >>"$LOG"; }
+
+# Skip silently if VM 109 config not on this node (cluster-aware).
+[ -f /etc/pve/qemu-server/${DR_VM_ID}.conf ] || exit 0
+
+# Skip if not running
+qm status "$DR_VM_ID" 2>/dev/null | grep -q running || exit 0
+
+# Skip if debug flag set
+[ -f "$DEBUG_FLAG" ] && exit 0
+
+# Get VM 109 uptime in seconds (process etime)
+PID_FILE="/var/run/qemu-server/${DR_VM_ID}.pid"
+[ -f "$PID_FILE" ] || exit 0
+VM_PID=$(cat "$PID_FILE")
+UPTIME_S=$(ps -p "$VM_PID" -o etimes= 2>/dev/null | tr -d ' ' || echo 0)
+
+# Within first hour: assume normal test run, no action
+[ "$UPTIME_S" -le "$MAX_RUNTIME_S" ] && exit 0
+
+# Inside Saturday test window: assume manual extended test, alert but do not stop
+DOW=$(date +%u)   # 1=Mon ... 7=Sun, Saturday=6
+NOW_MIN=$(( $(date +%H) * 60 + $(date +%M) ))
+if [ "$DOW" -eq 6 ] \
+    && [ "$NOW_MIN" -ge "$TEST_WINDOW_START_MIN" ] \
+    && [ "$NOW_MIN" -le "$TEST_WINDOW_END_MIN" ]; then
+    log "VM ${DR_VM_ID} running ${UPTIME_S}s in test window — no action (alert sent)"
+    echo "VM ${DR_VM_ID} running ${UPTIME_S}s during DR test window. Investigate." \
+        | mail -s "[WARN] VM 109 long-running in test window" root 2>/dev/null || true
+    exit 0
+fi
+
+# Outside test window + uptime exceeded: alert and stop
+log "VM ${DR_VM_ID} running ${UPTIME_S}s outside test window — stopping forcefully"
+
+ZFS_REPLICA=$(zfs list -t snapshot 2>/dev/null \
+    | awk '/vm-109-disk-1@/ {print $1}' | tail -1 || echo "unknown")
+
+cat <<EOF | mail -s "[CRITICAL] VM 109 watchdog: forced stop on $(hostname)" root 2>/dev/null || true
+VM ${DR_VM_ID} (oracle-dr-windows) was running for ${UPTIME_S}s outside the
+weekly DR test window (Saturday 05:55-07:30) on $(hostname).
+
+This indicates the DR test script either crashed without invoking its
+cleanup trap, or someone started VM ${DR_VM_ID} manually without setting
+${DEBUG_FLAG}.
+
+The watchdog is force-stopping VM ${DR_VM_ID} now to prevent another
+04-20-style memory exhaustion if HA failover were to fire while VM 109
+is consuming ${DR_VM_ID} memory.
+
+Latest VM 109 ZFS replica: ${ZFS_REPLICA}
+Watchdog log: ${LOG}
+
+To run a manual test without watchdog interference:
+    touch ${DEBUG_FLAG}
+    qm start ${DR_VM_ID}
+    # ... your work ...
+    qm stop ${DR_VM_ID}
+    rm ${DEBUG_FLAG}
+EOF
+
+qm stop "$DR_VM_ID" --skiplock --timeout 60 2>>"$LOG" || \
+    log "qm stop failed for VM ${DR_VM_ID}"
+
+log "Force stop completed"
--- a/proxmox/vm109-windows-dr/scripts/weekly-dr-test-proxmox.sh
+++ b/proxmox/vm109-windows-dr/scripts/weekly-dr-test-proxmox.sh
@@ -22,12 +22,16 @@

 set -euo pipefail

-# Cleanup trap: ensure VM 109 is always stopped on script exit
+# Cleanup trap: stop VM 109 on script exit ONLY if this script started it.
 # Fixes incident 2026-04-20: script crashed at SSH step and left VM 109 running
 # for 2.5 days, causing OOM cascade on pveelite after pvemini HA failover.
+# Guard prevents the trap from killing an externally-running VM during
+# --install / --help or when an operator launched it manually for debugging.
+DR_VM_STARTED_BY_US=false
 cleanup_vm() {
    local rc=$?
-    if qm status "${DR_VM_ID:-109}" 2>/dev/null | grep -q running; then
+    if [ "$DR_VM_STARTED_BY_US" = "true" ] \
+        && qm status "${DR_VM_ID:-109}" 2>/dev/null | grep -q running; then
        echo "[trap] VM ${DR_VM_ID:-109} still running at exit (rc=$rc), forcing stop"
        qm stop "${DR_VM_ID:-109}" --skiplock 2>/dev/null || true
    fi
@@ -40,6 +44,14 @@ export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

 # Configuration
 DR_VM_ID="109"
+
+# Cluster-aware exit: only the node currently hosting VM 109 should run the
+# test. With cron deployed on both pveelite (normal home) and pvemini (DR
+# failover home), this guard ensures only one instance fires.
+if [ ! -f "/etc/pve/qemu-server/${DR_VM_ID}.conf" ] && [ "${1:-}" != "--install" ] && [ "${1:-}" != "--help" ]; then
+    exit 0
+fi
+
 DR_VM_IP="10.0.20.37"
 DR_VM_PORT="22122"
 DR_VM_USER="romfast"
@@ -354,20 +366,53 @@ run_dr_test() {
    local step_start=$(date +%s)
    log "STEP 1: Pre-flight checks"

-    # Check backups exist
-    backup_count=$(find "$BACKUP_PATH" -maxdepth 1 -type f -name '*.BKP' 2>/dev/null | wc -l)
+    # Check 1a: Cluster quorate and not degraded.
+    # Refusing to test during a node outage prevents stacking VM 109 (6 GB)
+    # on top of a host already absorbing failover load — the 04-20 trigger.
+    local cluster_quorate
+    cluster_quorate=$(pvecm status 2>/dev/null | awk '/Quorate:/ {print $2}')
+    if [ "$cluster_quorate" != "Yes" ]; then
+        track_step "Pre-flight checks" false "Cluster not quorate (degraded?)" "$step_start"
+        test_result="FAILED - Cluster degraded"
+        backup_count=0
+    fi

-    if [ "$backup_count" -lt 2 ]; then
+    # Check 1b: Memory headroom on this host. Calculated from VM 109 config
+    # so it scales automatically if VM 109 memory is later resized.
+    local dr_vm_mem_mb avail_mb min_free_mb
+    dr_vm_mem_mb=$(qm config "$DR_VM_ID" 2>/dev/null | awk '/^memory:/ {print $2}')
+    avail_mb=$(awk '/^MemAvailable:/ {print int($2/1024)}' /proc/meminfo)
+    min_free_mb=$((dr_vm_mem_mb + 1024))
+
+    if [ "$test_result" != "FAILED - Cluster degraded" ] \
+        && [ "$avail_mb" -lt "$min_free_mb" ]; then
+        track_step "Pre-flight checks" false \
+            "Insufficient memory: ${avail_mb}MB available, need ${min_free_mb}MB" "$step_start"
+        test_result="FAILED - Insufficient memory"
+        backup_count=0
+    fi
+
+    # Check 1c: Backups exist (only if previous checks passed)
+    if [ "$test_result" = "FAILED" ]; then
+        backup_count=$(find "$BACKUP_PATH" -maxdepth 1 -type f -name '*.BKP' 2>/dev/null | wc -l)
+    fi
+
+    if [ "$test_result" != "FAILED" ]; then
+        : # already failed in cluster/memory check, skip
+    elif [ "$backup_count" -lt 2 ]; then
        track_step "Pre-flight checks" false "Insufficient backups (found: $backup_count)" "$step_start"
        test_result="FAILED - No backups"
    else
-        track_step "Pre-flight checks" true "Found $backup_count backups" "$step_start"
+        track_step "Pre-flight checks" true \
+            "Found $backup_count backups, ${avail_mb}MB available" "$step_start"

        # Step 2: Start VM
        step_start=$(date +%s)
        log "STEP 2: Starting DR VM"

-        if qm start "$DR_VM_ID" 2>/dev/null; then
+        local qm_start_output
+        if qm_start_output=$(qm start "$DR_VM_ID" 2>&1); then
+            DR_VM_STARTED_BY_US=true
            vm_status_label="Running"
            
            # Intelligent VM boot wait with polling (max 180s)
@@ -526,7 +571,8 @@ run_dr_test() {
            vm_status_label="Stopped"

        else
-            track_step "VM Startup" false "Failed to start VM $DR_VM_ID" "$step_start"
+            log_error "qm start $DR_VM_ID failed: $qm_start_output"
+            track_step "VM Startup" false "Failed to start VM $DR_VM_ID: $qm_start_output" "$step_start"
            vm_status_label="Failed to start"
        fi
    fi
--- a/proxmox/vm109-windows-dr/scripts/zfs-replicate-oracle-backups.sh
+++ b/proxmox/vm109-windows-dr/scripts/zfs-replicate-oracle-backups.sh
@@ -0,0 +1,73 @@
+#!/bin/bash
+#
+# Replicate rpool/oracle-backups from pveelite (active NFS server) to
+# pvemini (standby) every 15 minutes via incremental zfs send/recv.
+#
+# Why: NFS storage on pveelite is the single point that the DR test and
+# the daily SCP transfers from primary Oracle Windows depend on. With
+# 15-min ZFS replicas, pvemini can take over within minutes if pveelite
+# becomes unreachable (run /opt/scripts/failover-dr-to-pvemini.sh).
+#
+# Why not pvesr or pve-zsync:
+#   * pvesr only replicates VM/CT disks, not arbitrary datasets.
+#   * pve-zsync would add a package dependency for one job. zfs send
+#     over SSH is the simplest mechanism that fits the rest of the
+#     cluster's replication patterns.
+#
+# Schedule: */15 * * * * via cron on pveelite.
+# Initial sync:
+#   zfs send rpool/oracle-backups@init_<ts> | ssh root@<pvemini> \
+#     'zfs recv -F rpool/oracle-backups && zfs set readonly=on rpool/oracle-backups'
+#   ssh root@<pvemini> 'zfs set mountpoint=/mnt/pve/oracle-backups rpool/oracle-backups'
+
+set -euo pipefail
+export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
+
+DATASET="rpool/oracle-backups"
+TARGET_HOST="10.0.20.201"     # pvemini direct IP (avoids tailscale magicDNS detour)
+SNAP_PREFIX="repl"
+KEEP_SNAPS=5                  # rolling history on source side
+LOCK="/var/run/zfs-replicate-oracle-backups.lock"
+LOG="/var/log/oracle-dr/replication.log"
+SSH_OPTS="-o UserKnownHostsFile=/etc/pve/priv/known_hosts -o StrictHostKeyChecking=no -o BatchMode=yes"
+
+mkdir -p "$(dirname "$LOG")"
+exec 9>"$LOCK"
+flock -n 9 || { echo "[$(date)] previous run still active, skipping" >>"$LOG"; exit 0; }
+
+log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" >>"$LOG"; }
+
+NEW_SNAP="${DATASET}@${SNAP_PREFIX}_$(date +%Y%m%d_%H%M%S)"
+zfs snapshot "$NEW_SNAP"
+
+# Find previous replication snapshot (excluding the one we just made)
+PREV_SNAP=$(zfs list -t snapshot -o name -s creation "$DATASET" 2>/dev/null \
+    | awk -v p="${DATASET}@${SNAP_PREFIX}_" '$0 ~ p' \
+    | grep -v "$NEW_SNAP" \
+    | tail -1 || true)
+
+if [ -n "$PREV_SNAP" ]; then
+    log "Incremental send: $PREV_SNAP -> $NEW_SNAP"
+    if ! zfs send -i "$PREV_SNAP" "$NEW_SNAP" | \
+            ssh $SSH_OPTS root@${TARGET_HOST} "zfs recv -F $DATASET" 2>>"$LOG"; then
+        log "ERROR: incremental send failed"
+        zfs destroy "$NEW_SNAP" 2>/dev/null || true
+        exit 1
+    fi
+else
+    log "Full send (no previous snapshot found): $NEW_SNAP"
+    if ! zfs send "$NEW_SNAP" | \
+            ssh $SSH_OPTS root@${TARGET_HOST} "zfs recv -F $DATASET" 2>>"$LOG"; then
+        log "ERROR: full send failed"
+        zfs destroy "$NEW_SNAP" 2>/dev/null || true
+        exit 1
+    fi
+fi
+
+# Prune old snapshots on source (keep last KEEP_SNAPS)
+zfs list -t snapshot -o name -s creation "$DATASET" \
+    | awk -v p="${DATASET}@${SNAP_PREFIX}_" '$0 ~ p' \
+    | head -n -${KEEP_SNAPS} \
+    | xargs -r -n1 zfs destroy 2>>"$LOG" || true
+
+log "Replication completed successfully"
Author	SHA1	Message	Date
Claude Agent	1d67f0705b	docs(dr): refresh README with post-2026-04-25 architecture VM 109 returned to its original home on pveelite, co-located with oracle-backups NFS storage. The README is updated to reflect that: the VM is now in HA (ha-prefer-pveelite, state=stopped, nofailback=1) rather than excluded from HA, and the new layered defences (trap guard, watchdog cron, dynamic memory pre-flight, max_restart caps) are documented alongside the original `8a0c557` trap. Adds a Storage Failover section describing the pveelite -> pvemini manual failover flow: email alert from pveelite-down-alert.sh, failover-dr-to-pvemini.sh on the surviving node, failback when pveelite returns. The pve1 nightly mirror is the third copy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 19:19:38 +00:00
Claude Agent	b7fffe1467	feat(dr): manual failover/failback scripts + pveelite-down email alert failover-dr-to-pvemini.sh and failback-dr-to-pveelite.sh promote/demote the rpool/oracle-backups dataset between nodes when pveelite is down. Both refuse to run if the other side is reachable to prevent split-brain. Both patch transfer_backups.ps1 on Oracle Production (10.0.20.36) via SSH to redirect the daily SCP target between 10.0.20.202 and 10.0.20.201. The PowerShell patch uses -EncodedCommand (UTF-16LE base64) so the bash caller does not need to escape PowerShell quoting. End-to-end test including failover -> failback confirmed transfer_backups.ps1 returns to byte-identical state (SHA256 43DD2187...). pveelite-down-alert.sh runs every minute on pvemini and emails an alert with copy-paste failover instructions after 5 consecutive ping failures. The alert body includes the latest oracle-backups and VM 109 replica timestamps so the operator knows the recovery point before deciding. The DR weekly-test script gains a cluster-aware guard at the top that exits silently when /etc/pve/qemu-server/109.conf is not on the local node, allowing the same cron entry to be present on both pveelite and pvemini without double-firing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 19:18:51 +00:00
Claude Agent	a62bcb4331	feat(dr): replicate oracle-backups dataset, mirror to pve1 nightly Convert /mnt/pve/oracle-backups from a directory on the pveelite rootfs into a dedicated ZFS dataset rpool/oracle-backups so it can be incrementally replicated to pvemini. zfs-replicate-oracle-backups.sh runs every 15 minutes from cron on pveelite and uses zfs send/recv over the cluster's internal SSH (direct IP, /etc/pve/priv/known_hosts) to avoid Tailscale magicDNS detours that broke the first attempt. The destination dataset is set readonly=on so accidental writes on pvemini cannot diverge it. Snapshot pruning keeps 5 rolling copies. nightly-backup-mirror.sh ships a third copy nightly to pve1's backup-ssd (ext4 SATA) — different physical disk, different filesystem, different node — guarding against the failure mode where both pveelite and pvemini are simultaneously unavailable. The same script tars /etc/pve and rotates 14 days of cluster config archives, since pmxcfs is in-RAM and a multi-node quorum loss would otherwise take cluster config with it. The old directory is kept as oracle-backups.old-DELETE-AFTER-2026-05-02 on pveelite for one week as a safety net. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 19:00:04 +00:00
Claude Agent	62e9926bd4	feat(dr): add cluster + memory pre-flight, deploy VM 109 watchdog DR test script now refuses to start VM 109 if: * cluster is not quorate (e.g. mid-failover into a degraded state), * available memory on the host is below VM 109 config + 1 GB margin. Both checks scale automatically — memory threshold is computed from qm config so resizing VM 109 does not require touching the script. Adds vm109-watchdog.sh, scheduled cluster-wide every minute. The watchdog is the second line of defence behind the cleanup trap from `8a0c557`: it force-stops VM 109 if the trap was bypassed (script killed, host crash mid-test, manual run forgotten). It honours /var/run/vm109-debug.flag for legitimate manual sessions and is node-aware via /etc/pve/qemu-server/109.conf so it can be deployed on every node without coordinating with VM 109's current location. Both safeguards target the 04-18 → 04-20 chain: VM 109 left running 2.5 days then sandwiched against an HA failover that pushed CT 108 Oracle (8 GB) onto pveelite (16 GB) → OOM cascade. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 18:48:12 +00:00
Claude Agent	2e8cd9ca59	fix(dr-test): guard cleanup trap + surface qm start errors The cleanup trap added in `8a0c557` stopped VM 109 unconditionally on EXIT, which kills the VM during --install/--help or when an operator launched it manually for debugging. Gate the trap with DR_VM_STARTED_BY_US so it only fires when the script itself started the VM. Also remove the 2>/dev/null swallow on qm start so cross-node failures (e.g. running on a node where the VM is not configured) appear in the log instead of producing a silent "Failed to start VM 109" in 0 seconds. Root cause for the 2026-04-25 silent failure: cron lived on pveelite while VM 109 had been migrated to pvemini; qm start returned an error that was hidden by the redirect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 08:47:54 +00:00