8 Commits

Author SHA1 Message Date
Marius
132b4fb34b Proxmox HA: Fix false FAILED alerts and suppress cron notification emails
Fixed two critical issues with HA monitoring:
1. False positive quorum errors - corosync-quorumtool not in cron PATH
2. Unwanted cron emails from PVE::Notify INFO messages to STDERR

Changes:
- Set proper PATH including /usr/sbin for corosync-quorumtool
- Split notification code: verbose shows all, non-verbose redirects STDERR to /dev/null
- Prevents cron from sending duplicate notification emails

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-16 13:35:43 +03:00
Marius
d5bfc6b5c7 Add Oracle DR standby server scripts and Proxmox troubleshooting docs
- Add comprehensive Oracle backup and DR strategy documentation
- Add RMAN backup scripts (full and incremental)
- Add PowerShell transfer scripts for DR site
- Add bash restore and verification scripts
- Reorganize Oracle documentation structure
- Add Proxmox troubleshooting guide for VM 201 HA errors and NFS storage issues

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-08 13:37:33 +03:00
Marius
95f76d7ffb Fix: Add LXC container shutdown to UPS emergency shutdown script
The ups-shutdown-cluster.sh script was missing LXC container shutdown
functionality, only shutting down VMs. This could leave containers
running during UPS power failure, causing ungraceful shutdown.

Changes:
- Added Step 2: LXC container shutdown on all cluster nodes
- Uses 'pct list' to find running containers
- Shuts down each container with 60s timeout
- Parallel shutdown with '&' for speed
- Both local (pvemini) and remote nodes (pve1, pveelite)
- Updated step numbers (now 6 steps total vs 5 before)
- Fixed log_message() to use dynamic timestamp
- Fixed node name comment (pve2 → pveelite)

Shutdown order:
1. VMs on all nodes (timeout 60s)
2. LXC containers on all nodes (timeout 60s) [NEW]
3. Wait 90 seconds for graceful shutdown
4. Secondary nodes shutdown (pve1, pveelite)
5. Wait 30 seconds
6. Primary node shutdown (pvemini)

This matches the behavior in ups-maintenance-shutdown.sh which already
had LXC support.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-06 21:49:28 +03:00
Marius
cc72a5f96e Add UPS battery replacement procedure and maintenance shutdown script
Adds complete procedure for replacing UPS battery when entire cluster
is powered by the same UPS, requiring full cluster shutdown.

New files:
- scripts/ups-maintenance-shutdown.sh: Automated orchestrated shutdown
  for maintenance operations with confirmation prompts and progress display
- docs/UPS-BATTERY-REPLACEMENT.md: Complete step-by-step guide for battery
  replacement including pre-shutdown, physical replacement, and post-startup
  verification procedures

Features:
- Orchestrated shutdown: VMs → LXC containers → secondary nodes → primary
- Interactive confirmation before shutdown
- Color-coded progress indicators
- Countdown timers for each phase
- Post-replacement verification checklist
- Troubleshooting guide for common issues
- Recovery procedures for cluster/quorum problems

The procedure accounts for all 3 cluster nodes (pve1, pvemini, pveelite)
being on the same UPS, requiring complete infrastructure shutdown.

Documentation includes:
- When to replace battery (based on monthly test results)
- Pre-planning and user notification templates
- Physical battery replacement safety procedures
- Cluster recovery and VM restart procedures
- Post-replacement testing and verification
- 24-hour and 1-week monitoring checklists

Estimated maintenance window: 30-60 minutes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-06 21:46:28 +03:00
Marius
87b9709a0d Add complete UPS monitoring system with monthly battery testing
This commit adds a comprehensive UPS monitoring and management system for
the Proxmox cluster with automated shutdown orchestration and monthly
battery health testing.

Features:
- NUT (Network UPS Tools) configuration for INNO TECH USB UPS
- Automated cluster shutdown on power failure (3-minute grace period)
- Monthly automated battery testing with health evaluation
- Email notifications via PVE::Notify system
- WinNUT monitoring client for Windows VM 201

Components added:
- config/: NUT configuration files (ups.conf, upsd.conf, upsmon.conf, etc.)
- scripts/ups-shutdown-cluster.sh: Orchestrated cluster shutdown
- scripts/ups-monthly-test.sh: Monthly battery test with email reports
- scripts/upssched-cmd: Event handler for UPS state changes
- docs/: Complete installation and usage documentation

Key findings:
- UPS battery.charge reporting has 10-40 second delay after test start
- Test must monitor voltage drop (1.5-2V) and charge drop (9-27%)
- Battery health evaluation: EXCELLENT/GOOD/FAIR/POOR based on discharge rate
- Email notifications use Handlebars templates without Unicode emojis for compatibility

Configuration:
- UPS: INNO TECH (Voltronic protocol, vendor 0665:5161)
- Primary node: pvemini (10.0.20.201) with USB connection
- Monthly test: cron 0 0 1 * * /opt/scripts/ups-monthly-test.sh
- Shutdown timer: 180 seconds on battery before cluster shutdown

Documentation includes complete installation guides for NUT server,
WinNUT client, and troubleshooting procedures.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-06 21:39:46 +03:00
Marius
f3fca1f96e Update Proxmox HA monitoring script - remove qdevice support
Changes:
- Remove qdevice verification (qdevice no longer exists in cluster)
- Fix cluster nodes detection (updated pvecm status output format)
- Add --help parameter with complete usage documentation
- Update notification templates (remove qdevice references)
- Simplify quorum check (only verify total_votes = expected_votes)

The script now correctly monitors:
- HA Services (pve-ha-lrm, pve-ha-crm)
- Cluster Quorum (3/3 votes)
- Online nodes (3 nodes detected via Membership information)

Tested successfully on pvemini.romfast.ro (10.0.20.201)
Status: SUCCESSFUL with all checks passing

Also updated proxmox-ssh-guide.md with current cluster configuration.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-06 18:48:05 +03:00
Marius
b414b3c338 vm 107 monitor 2025-09-30 02:12:25 +03:00
Marius
24c8c75eb6 proxmox monitori 2025-09-30 02:06:34 +03:00