Adds end-to-end procedure for moving production back from DR (10.0.20.37)
to a repaired/reinstalled PRIMARY (10.0.20.36): final RMAN backup on DR
in restricted/read-only mode, RMAN restore on PRIMARY, app connection
switch, scheduled-task reactivation, VM 109 stop. Companion PowerShell
script handles the restore with sanity checks (IP, NFS, backup freshness)
and aborts if Oracle major version != 19, since failback to 21c would
need an extra dictionary upgrade step (~30-60 min) that adds untested
risk during the critical window — recommended path is 19c failback then
upgrade later in a planned window.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VM 109 returned to its original home on pveelite, co-located with
oracle-backups NFS storage. The README is updated to reflect that:
the VM is now in HA (ha-prefer-pveelite, state=stopped, nofailback=1)
rather than excluded from HA, and the new layered defences (trap
guard, watchdog cron, dynamic memory pre-flight, max_restart caps)
are documented alongside the original 8a0c557 trap.
Adds a Storage Failover section describing the pveelite -> pvemini
manual failover flow: email alert from pveelite-down-alert.sh,
failover-dr-to-pvemini.sh on the surviving node, failback when
pveelite returns. The pve1 nightly mirror is the third copy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
failover-dr-to-pvemini.sh and failback-dr-to-pveelite.sh promote/demote
the rpool/oracle-backups dataset between nodes when pveelite is down.
Both refuse to run if the other side is reachable to prevent split-brain.
Both patch transfer_backups.ps1 on Oracle Production (10.0.20.36) via
SSH to redirect the daily SCP target between 10.0.20.202 and 10.0.20.201.
The PowerShell patch uses -EncodedCommand (UTF-16LE base64) so the bash
caller does not need to escape PowerShell quoting. End-to-end test
including failover -> failback confirmed transfer_backups.ps1 returns
to byte-identical state (SHA256 43DD2187...).
pveelite-down-alert.sh runs every minute on pvemini and emails an alert
with copy-paste failover instructions after 5 consecutive ping failures.
The alert body includes the latest oracle-backups and VM 109 replica
timestamps so the operator knows the recovery point before deciding.
The DR weekly-test script gains a cluster-aware guard at the top that
exits silently when /etc/pve/qemu-server/109.conf is not on the local
node, allowing the same cron entry to be present on both pveelite and
pvemini without double-firing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Convert /mnt/pve/oracle-backups from a directory on the pveelite
rootfs into a dedicated ZFS dataset rpool/oracle-backups so it can be
incrementally replicated to pvemini. zfs-replicate-oracle-backups.sh
runs every 15 minutes from cron on pveelite and uses zfs send/recv
over the cluster's internal SSH (direct IP, /etc/pve/priv/known_hosts)
to avoid Tailscale magicDNS detours that broke the first attempt.
The destination dataset is set readonly=on so accidental writes on
pvemini cannot diverge it. Snapshot pruning keeps 5 rolling copies.
nightly-backup-mirror.sh ships a third copy nightly to pve1's
backup-ssd (ext4 SATA) — different physical disk, different
filesystem, different node — guarding against the failure mode where
both pveelite and pvemini are simultaneously unavailable. The same
script tars /etc/pve and rotates 14 days of cluster config archives,
since pmxcfs is in-RAM and a multi-node quorum loss would otherwise
take cluster config with it.
The old directory is kept as oracle-backups.old-DELETE-AFTER-2026-05-02
on pveelite for one week as a safety net.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DR test script now refuses to start VM 109 if:
* cluster is not quorate (e.g. mid-failover into a degraded state),
* available memory on the host is below VM 109 config + 1 GB margin.
Both checks scale automatically — memory threshold is computed from
qm config so resizing VM 109 does not require touching the script.
Adds vm109-watchdog.sh, scheduled cluster-wide every minute. The
watchdog is the second line of defence behind the cleanup trap from
8a0c557: it force-stops VM 109 if the trap was bypassed (script
killed, host crash mid-test, manual run forgotten). It honours
/var/run/vm109-debug.flag for legitimate manual sessions and is
node-aware via /etc/pve/qemu-server/109.conf so it can be deployed
on every node without coordinating with VM 109's current location.
Both safeguards target the 04-18 → 04-20 chain: VM 109 left running
2.5 days then sandwiched against an HA failover that pushed CT 108
Oracle (8 GB) onto pveelite (16 GB) → OOM cascade.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cleanup trap added in 8a0c557 stopped VM 109 unconditionally on EXIT,
which kills the VM during --install/--help or when an operator launched
it manually for debugging. Gate the trap with DR_VM_STARTED_BY_US so it
only fires when the script itself started the VM.
Also remove the 2>/dev/null swallow on qm start so cross-node failures
(e.g. running on a node where the VM is not configured) appear in the
log instead of producing a silent "Failed to start VM 109" in 0 seconds.
Root cause for the 2026-04-25 silent failure: cron lived on pveelite
while VM 109 had been migrated to pvemini; qm start returned an error
that was hidden by the redirect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Following the 2026-04-20 cluster outage, the cluster README now covers
HA resource limits, corosync token tuning (10s tolerance for USB glitches),
rasdaemon/netconsole/kdump diagnostic stack on pvemini, mail relay via
mail.romfast.ro with SMTP auth, OOM alerting via cron, and swap on pveelite.
VM 109 README now clearly states it was removed from HA and is only
started by the weekly DR test script.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The DR test script used set -euo pipefail, so a failing SSH
shutdown command caused the script to exit before qm stop.
On 2026-04-20 this left VM 109 running for 2.5 days and
triggered an OOM cascade when pvemini HA-failed over to
pveelite.
Adds EXIT trap that force-stops VM 109 regardless of exit
path, and makes the Step 7 SSH shutdown tolerant of failure.
Incident details: proxmox/cluster/incidents/2026-04-20-cluster-outage.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Move oracle/migration-scripts/ to proxmox/lxc108-oracle/migration/
- Move oracle/roa/ and oracle/roa-romconstruct/ to proxmox/lxc108-oracle/sql/
- Move oracle/standby-server-scripts/ to proxmox/vm109-windows-dr/
- Move chatbot/ to proxmox/lxc104-flowise/
- Update proxmox/README.md with new structure and navigation
- Update all documentation with correct directory references
- Remove unused input/claude-agent-sdk/ files
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>