- Proxmox default-matcher acum trimite doar la mail-to-root (pve1 smtp
eliminat din matcher → fix emailuri duble pentru backup/vzdump)
- Adaugat tabel cron jobs per nod cu motivul redirect-ului > /dev/null
- Regula: scripturi cu propriile notificari trebuie sa aiba redirect in
crontab, altfel cron genereaza email suplimentar de confirmare
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documenteaza incidentul Kingston SNV3S2000G hang la 2026-04-30 (Sensor 2
74°C → emergency mode + restart loop) si masurile aplicate: distantare
temporala backup-uri par/impar, mutare CT 101+110 pe pve1 backup-ssd,
nofail in fstab, hardware watchdog iTCO_wdt, monitoring CSV la 30 min.
Adauga scripturile /opt/scripts/kingston-thermal-{monitor,report}.sh
pentru tracking trend si alertare la depasirea pragurilor termale.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VM 201 (Windows critical) stays out of HA by design. Added:
- failover-vm201.sh: interactive failover pvemini -> pveelite with ZFS replication state
- recover-vm201-to-pvemini.sh: interactive reverse migration with uptime + split-brain checks
- pvemini-down-alert.sh: cron watchdog on pveelite, emails full runbook after 2min DOWN
Replication RPO tightened: CT 108 + VM 201 to 5min, CT 171 to 15min.
CT 171 added to HA (ha-group-main) for continuous Claude Code access.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Following the 2026-04-20 cluster outage, the cluster README now covers
HA resource limits, corosync token tuning (10s tolerance for USB glitches),
rasdaemon/netconsole/kdump diagnostic stack on pvemini, mail relay via
mail.romfast.ro with SMTP auth, OOM alerting via cron, and swap on pveelite.
VM 109 README now clearly states it was removed from HA and is only
started by the weekly DR test script.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The DR test script used set -euo pipefail, so a failing SSH
shutdown command caused the script to exit before qm stop.
On 2026-04-20 this left VM 109 running for 2.5 days and
triggered an OOM cascade when pvemini HA-failed over to
pveelite.
Adds EXIT trap that force-stops VM 109 regardless of exit
path, and makes the Step 7 SSH shutdown tolerant of failure.
Incident details: proxmox/cluster/incidents/2026-04-20-cluster-outage.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Create cluster/ for Proxmox cluster infrastructure (SSH guide, HA monitor, UPS)
- Create lxc108-oracle/ for Oracle Database documentation and scripts
- Create vm201-windows/ for Windows 11 VM docs and SSL certificate scripts
- Add SSL certificate monitoring scripts (check-ssl-certificates.ps1, monitor-ssl-certificates.sh)
- Remove archived VM107 references (decommissioned)
- Update all cross-references between files
- Update main README.md with new structure and navigation
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>