ROMFASTSQL

Author	SHA1	Message	Date
Claude Agent	62e9926bd4	feat(dr): add cluster + memory pre-flight, deploy VM 109 watchdog DR test script now refuses to start VM 109 if: * cluster is not quorate (e.g. mid-failover into a degraded state), * available memory on the host is below VM 109 config + 1 GB margin. Both checks scale automatically — memory threshold is computed from qm config so resizing VM 109 does not require touching the script. Adds vm109-watchdog.sh, scheduled cluster-wide every minute. The watchdog is the second line of defence behind the cleanup trap from `8a0c557`: it force-stops VM 109 if the trap was bypassed (script killed, host crash mid-test, manual run forgotten). It honours /var/run/vm109-debug.flag for legitimate manual sessions and is node-aware via /etc/pve/qemu-server/109.conf so it can be deployed on every node without coordinating with VM 109's current location. Both safeguards target the 04-18 → 04-20 chain: VM 109 left running 2.5 days then sandwiched against an HA failover that pushed CT 108 Oracle (8 GB) onto pveelite (16 GB) → OOM cascade. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 18:48:12 +00:00
Claude Agent	2e8cd9ca59	fix(dr-test): guard cleanup trap + surface qm start errors The cleanup trap added in `8a0c557` stopped VM 109 unconditionally on EXIT, which kills the VM during --install/--help or when an operator launched it manually for debugging. Gate the trap with DR_VM_STARTED_BY_US so it only fires when the script itself started the VM. Also remove the 2>/dev/null swallow on qm start so cross-node failures (e.g. running on a node where the VM is not configured) appear in the log instead of producing a silent "Failed to start VM 109" in 0 seconds. Root cause for the 2026-04-25 silent failure: cron lived on pveelite while VM 109 had been migrated to pvemini; qm start returned an error that was hidden by the redirect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 08:47:54 +00:00
Claude Agent	1203c24d63	docs(proxmox): document HA, corosync tuning, diagnostic tools and mail relay Following the 2026-04-20 cluster outage, the cluster README now covers HA resource limits, corosync token tuning (10s tolerance for USB glitches), rasdaemon/netconsole/kdump diagnostic stack on pvemini, mail relay via mail.romfast.ro with SMTP auth, OOM alerting via cron, and swap on pveelite. VM 109 README now clearly states it was removed from HA and is only started by the weekly DR test script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 11:30:46 +00:00
Claude Agent	60c27e7232	fix(vm109-dr): trap cleanup to stop VM 109 on script exit The DR test script used set -euo pipefail, so a failing SSH shutdown command caused the script to exit before qm stop. On 2026-04-20 this left VM 109 running for 2.5 days and triggered an OOM cascade when pvemini HA-failed over to pveelite. Adds EXIT trap that force-stops VM 109 regardless of exit path, and makes the Step 7 SSH shutdown tolerant of failure. Incident details: proxmox/cluster/incidents/2026-04-20-cluster-outage.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 11:16:04 +00:00
Marius	a567f75f25	Reorganize oracle/ and chatbot/ into proxmox/ per LXC/VM structure - Move oracle/migration-scripts/ to proxmox/lxc108-oracle/migration/ - Move oracle/roa/ and oracle/roa-romconstruct/ to proxmox/lxc108-oracle/sql/ - Move oracle/standby-server-scripts/ to proxmox/vm109-windows-dr/ - Move chatbot/ to proxmox/lxc104-flowise/ - Update proxmox/README.md with new structure and navigation - Update all documentation with correct directory references - Remove unused input/claude-agent-sdk/ files Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-27 17:28:53 +02:00

5 Commits