3 Commits

Author SHA1 Message Date
Marius
95f76d7ffb Fix: Add LXC container shutdown to UPS emergency shutdown script
The ups-shutdown-cluster.sh script was missing LXC container shutdown
functionality, only shutting down VMs. This could leave containers
running during UPS power failure, causing ungraceful shutdown.

Changes:
- Added Step 2: LXC container shutdown on all cluster nodes
- Uses 'pct list' to find running containers
- Shuts down each container with 60s timeout
- Parallel shutdown with '&' for speed
- Both local (pvemini) and remote nodes (pve1, pveelite)
- Updated step numbers (now 6 steps total vs 5 before)
- Fixed log_message() to use dynamic timestamp
- Fixed node name comment (pve2 → pveelite)

Shutdown order:
1. VMs on all nodes (timeout 60s)
2. LXC containers on all nodes (timeout 60s) [NEW]
3. Wait 90 seconds for graceful shutdown
4. Secondary nodes shutdown (pve1, pveelite)
5. Wait 30 seconds
6. Primary node shutdown (pvemini)

This matches the behavior in ups-maintenance-shutdown.sh which already
had LXC support.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-06 21:49:28 +03:00
Marius
cc72a5f96e Add UPS battery replacement procedure and maintenance shutdown script
Adds complete procedure for replacing UPS battery when entire cluster
is powered by the same UPS, requiring full cluster shutdown.

New files:
- scripts/ups-maintenance-shutdown.sh: Automated orchestrated shutdown
  for maintenance operations with confirmation prompts and progress display
- docs/UPS-BATTERY-REPLACEMENT.md: Complete step-by-step guide for battery
  replacement including pre-shutdown, physical replacement, and post-startup
  verification procedures

Features:
- Orchestrated shutdown: VMs → LXC containers → secondary nodes → primary
- Interactive confirmation before shutdown
- Color-coded progress indicators
- Countdown timers for each phase
- Post-replacement verification checklist
- Troubleshooting guide for common issues
- Recovery procedures for cluster/quorum problems

The procedure accounts for all 3 cluster nodes (pve1, pvemini, pveelite)
being on the same UPS, requiring complete infrastructure shutdown.

Documentation includes:
- When to replace battery (based on monthly test results)
- Pre-planning and user notification templates
- Physical battery replacement safety procedures
- Cluster recovery and VM restart procedures
- Post-replacement testing and verification
- 24-hour and 1-week monitoring checklists

Estimated maintenance window: 30-60 minutes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-06 21:46:28 +03:00
Marius
87b9709a0d Add complete UPS monitoring system with monthly battery testing
This commit adds a comprehensive UPS monitoring and management system for
the Proxmox cluster with automated shutdown orchestration and monthly
battery health testing.

Features:
- NUT (Network UPS Tools) configuration for INNO TECH USB UPS
- Automated cluster shutdown on power failure (3-minute grace period)
- Monthly automated battery testing with health evaluation
- Email notifications via PVE::Notify system
- WinNUT monitoring client for Windows VM 201

Components added:
- config/: NUT configuration files (ups.conf, upsd.conf, upsmon.conf, etc.)
- scripts/ups-shutdown-cluster.sh: Orchestrated cluster shutdown
- scripts/ups-monthly-test.sh: Monthly battery test with email reports
- scripts/upssched-cmd: Event handler for UPS state changes
- docs/: Complete installation and usage documentation

Key findings:
- UPS battery.charge reporting has 10-40 second delay after test start
- Test must monitor voltage drop (1.5-2V) and charge drop (9-27%)
- Battery health evaluation: EXCELLENT/GOOD/FAIR/POOR based on discharge rate
- Email notifications use Handlebars templates without Unicode emojis for compatibility

Configuration:
- UPS: INNO TECH (Voltronic protocol, vendor 0665:5161)
- Primary node: pvemini (10.0.20.201) with USB connection
- Monthly test: cron 0 0 1 * * /opt/scripts/ups-monthly-test.sh
- Shutdown timer: 180 seconds on battery before cluster shutdown

Documentation includes complete installation guides for NUT server,
WinNUT client, and troubleshooting procedures.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-06 21:39:46 +03:00