Update Proxmox HA monitoring script - remove qdevice support

Changes:
- Remove qdevice verification (qdevice no longer exists in cluster)
- Fix cluster nodes detection (updated pvecm status output format)
- Add --help parameter with complete usage documentation
- Update notification templates (remove qdevice references)
- Simplify quorum check (only verify total_votes = expected_votes)

The script now correctly monitors:
- HA Services (pve-ha-lrm, pve-ha-crm)
- Cluster Quorum (3/3 votes)
- Online nodes (3 nodes detected via Membership information)

Tested successfully on pvemini.romfast.ro (10.0.20.201)
Status: SUCCESSFUL with all checks passing

Also updated proxmox-ssh-guide.md with current cluster configuration.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Marius
2025-10-06 18:48:05 +03:00
parent 8795b92887
commit f3fca1f96e
3 changed files with 213 additions and 341 deletions

View File

@@ -1,6 +1,6 @@
#!/bin/bash
# HA Monitor cu PVE::Notify - versiune finală
# HA Monitor cu PVE::Notify - versiune fără qdevice
# Folosește sistemul nativ Proxmox cu template-uri personalizate
#
# TEMPLATE SYSTEM:
@@ -33,22 +33,73 @@ FQDN=$(hostname -f)
DATE=$(date '+%Y-%m-%d %H:%M:%S')
START_TIME=$(date +%s)
# Funcție pentru crearea template-urilor de notificare
create_templates() {
local template_dir="/etc/pve/notification-templates/default"
# Creează directorul dacă nu există
mkdir -p "$template_dir"
echo "Creating notification templates in $template_dir..."
# Template pentru subject - pentru SUCCESS
cat > "$template_dir/ha-status-subject.txt.hbs" << 'EOF'
# Verifică parametri înainte de execuție
if [ "$1" == "--help" ] || [ "$1" == "-h" ]; then
cat << 'HELP'
HA Monitor Script - Proxmox High Availability Monitoring
USAGE:
/opt/scripts/ha-monitor.sh [OPTION]
OPTIONS:
(no option) Run HA check and send notification via Proxmox notification system
-v, --verbose Run HA check with detailed console output
--create-templates Recreate notification templates in /etc/pve/notification-templates/default/
-h, --help Display this help message
DESCRIPTION:
This script monitors the Proxmox HA cluster status and sends notifications
using the native Proxmox notification system (PVE::Notify).
It checks:
- HA Services (pve-ha-lrm, pve-ha-crm)
- Cluster Quorum status
- Number of online cluster nodes
NOTIFICATION TEMPLATES:
Templates are stored in: /etc/pve/notification-templates/default/
- ha-status-subject.txt.hbs (email subject)
- ha-status-body.txt.hbs (email body text)
- ha-status-body.html.hbs (email body HTML)
LOG FILE:
/var/log/pve-ha-monitor.log
EXAMPLES:
# Run normal check (silent, sends notification)
/opt/scripts/ha-monitor.sh
# Run with verbose output
/opt/scripts/ha-monitor.sh -v
# Recreate email templates
/opt/scripts/ha-monitor.sh --create-templates
CRON SETUP:
To run every 5 minutes:
*/5 * * * * /opt/scripts/ha-monitor.sh
HELP
exit 0
fi
if [ "$1" == "--create-templates" ] || [ "$1" == "--templates" ]; then
# Funcție pentru crearea template-urilor de notificare
create_templates() {
local template_dir="/etc/pve/notification-templates/default"
# Creează directorul dacă nu există
mkdir -p "$template_dir"
echo "Creating notification templates in $template_dir..."
# Template pentru subject - pentru SUCCESS
cat > "$template_dir/ha-status-subject.txt.hbs" << 'EOF'
{{#if (eq status "SUCCESSFUL")}}✅ HA CLUSTER OK - {{ hostname }}{{else}}🚨 HA CLUSTER ISSUES - {{ hostname }}{{/if}}
EOF
# Template pentru body text
cat > "$template_dir/ha-status-body.txt.hbs" << 'EOF'
# Template pentru body text
cat > "$template_dir/ha-status-body.txt.hbs" << 'EOF'
{{#if (eq status "SUCCESSFUL")}}✅ HIGH AVAILABILITY STATUS: ALL SYSTEMS OK{{else}}🚨 HIGH AVAILABILITY CLUSTER HAS ISSUES{{/if}}
Host: {{ hostname }}
@@ -58,40 +109,17 @@ CLUSTER STATUS:
{{ details }}
{{#if (eq status "FAILED")}}
=== HOW TO READ pvecm status OUTPUT ===
=== IMMEDIATE ACTIONS REQUIRED ===
Your current problematic output shows:
- Total votes: 2 (WRONG - should be 3)
- Qdevice (votes 0) (WRONG - should be votes 1)
1. SSH to cluster: ssh root@{{ hostname }}
2. Check overall status: pvecm status
3. Review HA logs: journalctl -u pve-ha-lrm -u pve-ha-crm -n 20
4. Check network connectivity between nodes
5. Verify all cluster nodes are online
After fix should show:
- Total votes: 3 (CORRECT)
- Qdevice (votes 1) (CORRECT)
=== STEP-BY-STEP FIX ===
Step 1 - Fix Qdevice (PRIORITY):
systemctl restart corosync-qdevice
sleep 5
corosync-qdevice-tool -s
Step 2 - Verify cluster status:
pvecm status
LOOK FOR: Total votes: 3 (not 2!) and Qdevice (votes 1)
Step 3 - Test HA functionality:
ha-manager status
=== WHAT THIS MEANS ===
QDEVICE DISCONNECTED: No tie-breaker vote
- If one node fails, cluster may lose quorum
- VMs won't automatically migrate
The cluster works now but has no tie-breaker vote.
One node failure = no quorum = VMs can't migrate.
{{else}}
All HA components are functioning normally.
- Cluster has proper quorum with qdevice participation
- Cluster has proper quorum
- Automatic VM migration is available
- System is fully redundant
{{/if}}
@@ -115,8 +143,8 @@ Log file: /var/log/pve-ha-monitor.log
Total check time: {{ runtime }}s
EOF
# Template pentru body HTML cu font mai mare și consistent
cat > "$template_dir/ha-status-body.html.hbs" << 'EOF'
# Template pentru body HTML cu font mai mare și consistent
cat > "$template_dir/ha-status-body.html.hbs" << 'EOF'
<div style="font-family: Arial, sans-serif; font-size: 16px; line-height: 1.5; max-width: 800px;">
{{#if (eq status "SUCCESSFUL")}}
@@ -132,41 +160,22 @@ EOF
<pre style="font-size: 15px; background: #f8f9fa; padding: 12px; border: 1px solid #ddd; border-radius: 4px; margin-bottom: 20px;">{{ details }}</pre>
{{#if (eq status "FAILED")}}
<h3 style="font-size: 18px; margin-top: 20px; margin-bottom: 10px;">HOW TO READ pvecm status OUTPUT</h3>
<p style="font-size: 16px; margin-bottom: 10px;"><strong>Your current problematic output shows:</strong></p>
<ul style="font-size: 16px; margin-bottom: 15px;">
<li>Total votes: 2 <strong style="color: red;">(WRONG - should be 3)</strong></li>
<li>Qdevice (votes 0) <strong style="color: red;">(WRONG - should be votes 1)</strong></li>
</ul>
<h3 style="font-size: 18px; margin-top: 20px; margin-bottom: 10px;">IMMEDIATE ACTIONS REQUIRED</h3>
<p style="font-size: 16px; margin-bottom: 10px;"><strong>After fix should show:</strong></p>
<ul style="font-size: 16px; margin-bottom: 15px;">
<li>Total votes: 3 <strong style="color: green;">(CORRECT)</strong></li>
<li>Qdevice (votes 1) <strong style="color: green;">(CORRECT)</strong></li>
</ul>
<ol style="font-size: 16px; margin-bottom: 15px;">
<li>SSH to cluster: <code>ssh root@{{ hostname }}</code></li>
<li>Check overall status: <code>pvecm status</code></li>
<li>Review HA logs: <code>journalctl -u pve-ha-lrm -u pve-ha-crm -n 20</code></li>
<li>Check network connectivity between nodes</li>
<li>Verify all cluster nodes are online</li>
</ol>
<h3 style="font-size: 18px; margin-top: 20px; margin-bottom: 10px;">STEP-BY-STEP FIX</h3>
<h4 style="font-size: 16px; margin-top: 15px; margin-bottom: 8px;">Step 1 - Fix Qdevice:</h4>
<div style="font-size: 15px; background: #f8f9fa; padding: 12px; border: 1px solid #ddd; border-radius: 4px; margin-bottom: 10px;">
<div style="margin-bottom: 5px;">systemctl restart corosync-qdevice</div>
<div style="margin-bottom: 5px;">sleep 5</div>
<div>corosync-qdevice-tool -s</div>
</div>
<h4 style="font-size: 16px; margin-top: 15px; margin-bottom: 8px;">Step 2 - Verify status:</h4>
<div style="font-size: 15px; background: #f8f9fa; padding: 12px; border: 1px solid #ddd; border-radius: 4px; margin-bottom: 10px;">
<div>pvecm status</div>
</div>
<p style="font-size: 16px; margin-bottom: 15px;"><strong>LOOK FOR:</strong> Total votes: 3 (not 2!) and Qdevice (votes 1)</p>
<p style="font-size: 16px; background: #f8d7da; padding: 12px; border-radius: 4px; margin-top: 15px;"><strong>Bottom line:</strong> The cluster works now but has no tie-breaker vote.<br>
One node failure = no quorum = VMs can't migrate.</p>
<p style="font-size: 16px; background: #f8d7da; padding: 12px; border-radius: 4px; margin-top: 15px;"><strong>Warning:</strong> Issues detected in the cluster. Immediate attention required to ensure high availability.</p>
{{else}}
<p style="font-size: 16px; background: #d4edda; padding: 12px; border-radius: 4px; margin-top: 15px;"><strong>All HA components are functioning normally:</strong></p>
<ul style="font-size: 16px; margin-top: 10px;">
<li>Cluster has proper quorum with qdevice participation</li>
<li>Cluster has proper quorum</li>
<li>Automatic VM migration is available</li>
<li>System is fully redundant</li>
</ul>
@@ -198,12 +207,13 @@ One node failure = no quorum = VMs can't migrate.</p>
</div>
EOF
echo "Templates created successfully."
}
echo "Templates created successfully."
}
# Creează template-urile la prima rulare sau dacă nu există
if [ ! -f "/etc/pve/notification-templates/default/ha-status-subject.txt.hbs" ]; then
create_templates
echo "Templates recreated successfully."
echo "Run './ha-monitor.sh -v' to test with new templates."
exit 0
fi
# Verificare HA status
@@ -220,24 +230,19 @@ check_ha_status() {
status_ok=false
fi
# Verifică quorum și qdevice
# Verifică quorum
quorum_info=$(corosync-quorumtool -s 2>/dev/null)
pvecm_info=$(pvecm status 2>/dev/null)
if echo "$quorum_info" | grep -q "Quorate:.*Yes"; then
expected_votes=$(echo "$quorum_info" | grep "Expected votes:" | awk '{print $3}')
total_votes=$(echo "$quorum_info" | grep "Total votes:" | awk '{print $3}')
# Verifică qdevice prin pvecm status - caută linia cu "Qdevice"
qdevice_votes=$(echo "$pvecm_info" | grep -E "^[[:space:]]*0x00000000[[:space:]]+1[[:space:]]+Qdevice" | awk '{print $2}')
if [ "$total_votes" = "$expected_votes" ] && [ "$qdevice_votes" = "1" ]; then
details+="Quorum: OK ($total_votes/$expected_votes votes, Qdevice participating)\n"
elif [ "$total_votes" = "$expected_votes" ]; then
if [ "$total_votes" = "$expected_votes" ]; then
details+="Quorum: OK ($total_votes/$expected_votes votes)\n"
else
details+="Quorum: WARNING ($total_votes/$expected_votes votes)\n"
details+=" Check: pvecm status for qdevice participation\n"
details+=" Check: pvecm status\n"
status_ok=false
fi
else
@@ -246,20 +251,9 @@ check_ha_status() {
status_ok=false
fi
# Verifică conectivitatea qdevice
qdevice_status=$(corosync-qdevice-tool -s 2>/dev/null)
if echo "$qdevice_status" | grep -q "State:.*Connected"; then
qnetd_host=$(echo "$qdevice_status" | grep "QNetd host:" | awk '{print $3}')
details+="Qdevice Connection: OK ($qnetd_host)\n"
else
details+="Qdevice Connection: WARNING - Disconnected\n"
details+=" Recovery: systemctl restart corosync-qdevice\n"
status_ok=false
fi
# Verifică nodurile prin pvecm status
nodes_online=$(echo "$pvecm_info" | grep -c "A,V,NMW")
# Verifică nodurile prin pvecm status - numără liniile din Membership information
nodes_online=$(echo "$pvecm_info" | grep -E "^[[:space:]]*0x[0-9a-fA-F]+" | wc -l)
if [ "$nodes_online" -ge 2 ]; then
details+="Cluster Nodes: OK ($nodes_online nodes online)\n"
else
@@ -352,9 +346,4 @@ if [ "$1" == "--verbose" ] || [ "$1" == "-v" ]; then
echo
echo "Using template: ha-status"
echo "Template data: hostname=$FQDN, status=$STATUS, runtime=${RUNTIME}s"
elif [ "$1" == "--create-templates" ] || [ "$1" == "--templates" ]; then
create_templates
echo "Templates recreated successfully."
echo "Run './ha-monitor.sh -v' to test with new templates."
exit 0
fi