ROMFASTSQL

Author	SHA1	Message	Date
Claude Agent	8846c9c855	docs(dr): document failback DR -> PRIMARY procedure + restore script Adds end-to-end procedure for moving production back from DR (10.0.20.37) to a repaired/reinstalled PRIMARY (10.0.20.36): final RMAN backup on DR in restricted/read-only mode, RMAN restore on PRIMARY, app connection switch, scheduled-task reactivation, VM 109 stop. Companion PowerShell script handles the restore with sanity checks (IP, NFS, backup freshness) and aborts if Oracle major version != 19, since failback to 21c would need an extra dictionary upgrade step (~30-60 min) that adds untested risk during the critical window — recommended path is 19c failback then upgrade later in a planned window. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 20:18:09 +00:00
Claude Agent	1d67f0705b	docs(dr): refresh README with post-2026-04-25 architecture VM 109 returned to its original home on pveelite, co-located with oracle-backups NFS storage. The README is updated to reflect that: the VM is now in HA (ha-prefer-pveelite, state=stopped, nofailback=1) rather than excluded from HA, and the new layered defences (trap guard, watchdog cron, dynamic memory pre-flight, max_restart caps) are documented alongside the original `8a0c557` trap. Adds a Storage Failover section describing the pveelite -> pvemini manual failover flow: email alert from pveelite-down-alert.sh, failover-dr-to-pvemini.sh on the surviving node, failback when pveelite returns. The pve1 nightly mirror is the third copy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 19:19:38 +00:00
Claude Agent	b7fffe1467	feat(dr): manual failover/failback scripts + pveelite-down email alert failover-dr-to-pvemini.sh and failback-dr-to-pveelite.sh promote/demote the rpool/oracle-backups dataset between nodes when pveelite is down. Both refuse to run if the other side is reachable to prevent split-brain. Both patch transfer_backups.ps1 on Oracle Production (10.0.20.36) via SSH to redirect the daily SCP target between 10.0.20.202 and 10.0.20.201. The PowerShell patch uses -EncodedCommand (UTF-16LE base64) so the bash caller does not need to escape PowerShell quoting. End-to-end test including failover -> failback confirmed transfer_backups.ps1 returns to byte-identical state (SHA256 43DD2187...). pveelite-down-alert.sh runs every minute on pvemini and emails an alert with copy-paste failover instructions after 5 consecutive ping failures. The alert body includes the latest oracle-backups and VM 109 replica timestamps so the operator knows the recovery point before deciding. The DR weekly-test script gains a cluster-aware guard at the top that exits silently when /etc/pve/qemu-server/109.conf is not on the local node, allowing the same cron entry to be present on both pveelite and pvemini without double-firing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 19:18:51 +00:00
Claude Agent	a62bcb4331	feat(dr): replicate oracle-backups dataset, mirror to pve1 nightly Convert /mnt/pve/oracle-backups from a directory on the pveelite rootfs into a dedicated ZFS dataset rpool/oracle-backups so it can be incrementally replicated to pvemini. zfs-replicate-oracle-backups.sh runs every 15 minutes from cron on pveelite and uses zfs send/recv over the cluster's internal SSH (direct IP, /etc/pve/priv/known_hosts) to avoid Tailscale magicDNS detours that broke the first attempt. The destination dataset is set readonly=on so accidental writes on pvemini cannot diverge it. Snapshot pruning keeps 5 rolling copies. nightly-backup-mirror.sh ships a third copy nightly to pve1's backup-ssd (ext4 SATA) — different physical disk, different filesystem, different node — guarding against the failure mode where both pveelite and pvemini are simultaneously unavailable. The same script tars /etc/pve and rotates 14 days of cluster config archives, since pmxcfs is in-RAM and a multi-node quorum loss would otherwise take cluster config with it. The old directory is kept as oracle-backups.old-DELETE-AFTER-2026-05-02 on pveelite for one week as a safety net. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 19:00:04 +00:00
Claude Agent	62e9926bd4	feat(dr): add cluster + memory pre-flight, deploy VM 109 watchdog DR test script now refuses to start VM 109 if: * cluster is not quorate (e.g. mid-failover into a degraded state), * available memory on the host is below VM 109 config + 1 GB margin. Both checks scale automatically — memory threshold is computed from qm config so resizing VM 109 does not require touching the script. Adds vm109-watchdog.sh, scheduled cluster-wide every minute. The watchdog is the second line of defence behind the cleanup trap from `8a0c557`: it force-stops VM 109 if the trap was bypassed (script killed, host crash mid-test, manual run forgotten). It honours /var/run/vm109-debug.flag for legitimate manual sessions and is node-aware via /etc/pve/qemu-server/109.conf so it can be deployed on every node without coordinating with VM 109's current location. Both safeguards target the 04-18 → 04-20 chain: VM 109 left running 2.5 days then sandwiched against an HA failover that pushed CT 108 Oracle (8 GB) onto pveelite (16 GB) → OOM cascade. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 18:48:12 +00:00
Claude Agent	2e8cd9ca59	fix(dr-test): guard cleanup trap + surface qm start errors The cleanup trap added in `8a0c557` stopped VM 109 unconditionally on EXIT, which kills the VM during --install/--help or when an operator launched it manually for debugging. Gate the trap with DR_VM_STARTED_BY_US so it only fires when the script itself started the VM. Also remove the 2>/dev/null swallow on qm start so cross-node failures (e.g. running on a node where the VM is not configured) appear in the log instead of producing a silent "Failed to start VM 109" in 0 seconds. Root cause for the 2026-04-25 silent failure: cron lived on pveelite while VM 109 had been migrated to pvemini; qm start returned an error that was hidden by the redirect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 08:47:54 +00:00
Claude Agent	1203c24d63	docs(proxmox): document HA, corosync tuning, diagnostic tools and mail relay Following the 2026-04-20 cluster outage, the cluster README now covers HA resource limits, corosync token tuning (10s tolerance for USB glitches), rasdaemon/netconsole/kdump diagnostic stack on pvemini, mail relay via mail.romfast.ro with SMTP auth, OOM alerting via cron, and swap on pveelite. VM 109 README now clearly states it was removed from HA and is only started by the weekly DR test script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 11:30:46 +00:00
Claude Agent	60c27e7232	fix(vm109-dr): trap cleanup to stop VM 109 on script exit The DR test script used set -euo pipefail, so a failing SSH shutdown command caused the script to exit before qm stop. On 2026-04-20 this left VM 109 running for 2.5 days and triggered an OOM cascade when pvemini HA-failed over to pveelite. Adds EXIT trap that force-stops VM 109 regardless of exit path, and makes the Step 7 SSH shutdown tolerant of failure. Incident details: proxmox/cluster/incidents/2026-04-20-cluster-outage.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 11:16:04 +00:00
Marius	a567f75f25	Reorganize oracle/ and chatbot/ into proxmox/ per LXC/VM structure - Move oracle/migration-scripts/ to proxmox/lxc108-oracle/migration/ - Move oracle/roa/ and oracle/roa-romconstruct/ to proxmox/lxc108-oracle/sql/ - Move oracle/standby-server-scripts/ to proxmox/vm109-windows-dr/ - Move chatbot/ to proxmox/lxc104-flowise/ - Update proxmox/README.md with new structure and navigation - Update all documentation with correct directory references - Remove unused input/claude-agent-sdk/ files Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-27 17:28:53 +02:00

9 Commits