Document VM 201 power outage incident and update HA configuration

- Add troubleshooting guide for 2026-01-11 power outage incident
- Update vm201-windows11.md with correct storage details (disk-1, disk-3)
- Remove HA configuration, document manual failover procedure
- Add ZFS replication status and commands
- Document lessons learned: ISO attachments block migration

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Marius
2026-01-11 15:28:00 +02:00
parent 594b77e449
commit 00c6410dbd
2 changed files with 251 additions and 9 deletions

View File

@@ -24,12 +24,17 @@
### Storage Details
```bash
# Verificare disk usage
ssh root@10.0.20.201 "qm config 201 | grep scsi"
ssh root@10.0.20.201 "qm config 201 | grep -E 'efidisk|virtio'"
# Output:
# scsi0: local-zfs:vm-201-disk-0,size=500G
# efidisk0: local-zfs:vm-201-disk-1,efitype=4m,pre-enrolled-keys=1,size=528K
# virtio0: local-zfs:vm-201-disk-3,size=500G
```
**Discuri active:**
- `vm-201-disk-1` - EFI disk (528K)
- `vm-201-disk-3` - Disk principal Windows (500GB, ~89GB utilizat)
### Network Configuration
- **Interface:** net0 - virtio bridge=vmbr0
- **IP Assignment:** DHCP (managed by network DHCP server)
@@ -335,6 +340,7 @@ ssh root@10.0.20.201 "qm delsnapshot 201 pre-update-snapshot"
### Documentație VM 201 Specifică
- **SSL Certificates IIS:** `vm201-certificat-letsencrypt-iis.md`
- **Troubleshooting Incident 2025-10-08:** `vm201-troubleshooting-backup-nfs.md`
- **Troubleshooting Pană Curent 2026-01-11:** `vm201-troubleshooting-pana-curent-2026-01-11.md`
### Documentație Infrastructură Generală
- **Proxmox Cluster General:** `proxmox-ssh-guide.md`
@@ -367,13 +373,47 @@ ssh root@10.0.20.201 "qm delsnapshot 201 pre-update-snapshot"
- Gitea: 10.0.20.165:3000
- Portainer: 10.0.20.170:9443
### High Availability
- **HA Status:** Enabled (managed by pve-ha-crm)
- **Priority:** Normal
- **Autostart:** Enabled (onboot: 1)
- **Recovery:** Automatic VM migration în caz de node failure
### High Availability și Replicare
**Troubleshooting HA:** Vezi `vm201-troubleshooting-backup-nfs.md` → "VM 201 - HA Error"
#### HA Status: DEZACTIVAT (Control Manual)
- **HA:** Eliminat din cluster HA (decizie post-incident 2026-01-11)
- **Motiv:** Control manual asupra failover-ului pentru a evita blocaje
- **Autostart:** Enabled (onboot: 1) - pornește automat când nodul bootează
#### Replicare ZFS (Activă)
- **Job 201-0:** pvemini → pve1 (la fiecare 30 min)
- **Job 201-1:** pvemini → pveelite (la fiecare 30 min)
```bash
# Verificare status replicare
ssh root@10.0.20.201 "pvesr status | grep 201"
# Forțare replicare imediată
ssh root@10.0.20.201 "pvesr schedule-now 201-0 && pvesr schedule-now 201-1"
```
#### Procedură Failover Manual (când pvemini cade)
**Important:** `/etc/pve` este un filesystem partajat (pmxcfs). Chiar dacă pvemini e offline, configurarea VM-ului e accesibilă de pe orice nod din cluster.
```bash
# De pe pveelite (10.0.20.202) sau pve1 (10.0.20.200):
# 1. Mută configurarea pe nodul unde vrei să pornești
mv /etc/pve/nodes/pvemini/qemu-server/201.conf /etc/pve/nodes/pveelite/qemu-server/201.conf
# 2. Pornește VM-ul (discurile sunt deja replicate)
qm start 201
```
#### Procedură Failback (migrare înapoi pe pvemini)
```bash
# Când pvemini e din nou online, migrează VM-ul înapoi
qm migrate 201 pvemini --online
```
**Troubleshooting:** Vezi `vm201-troubleshooting-pana-curent-2026-01-11.md`
---
@@ -402,6 +442,6 @@ ssh root@10.0.20.201 "qm delsnapshot 201 pre-update-snapshot"
---
**Ultima actualizare:** 2025-11-19
**Ultima actualizare:** 2026-01-11
**Autor:** Marius Mutu
**Proiect:** ROMFASTSQL - VM 201 Documentation