Add Oracle DR standby server scripts and Proxmox troubleshooting docs

- Add comprehensive Oracle backup and DR strategy documentation - Add RMAN backup scripts (full and incremental) - Add PowerShell transfer scripts for DR site - Add bash restore and verification scripts - Reorganize Oracle documentation structure - Add Proxmox troubleshooting guide for VM 201 HA errors and NFS storage issues 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-08 13:37:33 +03:00
parent 95f76d7ffb
commit d5bfc6b5c7
19 changed files with 6766 additions and 0 deletions
--- a/proxmox/troubleshooting-vm201-backup-nfs.md
+++ b/proxmox/troubleshooting-vm201-backup-nfs.md
@@ -0,0 +1,394 @@
+# Troubleshooting: VM 201 Locked & Backup-NFS Unknown
+
+**Data:** 2025-10-08
+**Noduri afectate:** pvemini (10.0.20.201)
+**Resurse afectate:** VM 201 (roacentral), Storage backup-nfs
+
+---
+
+## Problema 1: VM 201 - Status Running dar HA Error
+
+### Simptome
+- VM 201 (Windows 11) înghețat
+- GUI Proxmox arăta: **running** dar cu **HA error**
+- Încercări de reboot/stop din GUI au eșuat
+- VM nu răspundea la comenzi
+
+### Diagnostic
+
+#### 1. Verificare status VM
+```bash
+ssh root@10.0.20.201 "qm status 201"
+# Output: status: running
+```
+
+#### 2. Verificare status HA
+```bash
+ssh root@10.0.20.201 "ha-manager status"
+# Output: service vm:201 (pvemini, error)
+```
+
+#### 3. Verificare configurație VM
+```bash
+ssh root@10.0.20.201 "qm config 201"
+# Output: lock: backup
+```
+
+**Cauză identificată:** Lock de backup rămas activ în configurație, probabil de la un job de backup întrerupt.
+
+#### 4. Verificare log-uri HA
+```bash
+ssh root@10.0.20.201 "journalctl -u pve-ha-lrm --since '5 minutes ago' --no-pager | grep -i '201\|error'"
+```
+
+**Output relevant:**
+```
+Oct 08 11:18:46 pvemini pve-ha-lrm: can't lock file '/var/lock/qemu-server/lock-201.conf' - got timeout
+Oct 08 11:18:46 pvemini pve-ha-lrm: unable to stop service vm:201 (still running)
+Oct 08 11:18:56 pvemini pve-ha-lrm: service vm:201 is in an error state and needs manual intervention
+```
+
+#### 5. Verificare proces KVM
+```bash
+ssh root@10.0.20.201 "ps aux | grep 'qm\|kvm' | grep 201"
+```
+**Rezultat:** Procesul KVM (PID 3628) rula de 2 zile dar era înghețat.
+
+### Rezolvare VM 201
+
+#### Pas 1: Unlock VM
+```bash
+ssh root@10.0.20.201 "qm unlock 201"
+# Success - lock-ul a fost eliminat
+```
+
+#### Pas 2: Încercare stop normal (a eșuat)
+```bash
+ssh root@10.0.20.201 "qm stop 201"
+# Error: service 'vm:201' in error state, must be disabled and fixed first
+```
+
+#### Pas 3: Remove VM din HA management
+```bash
+ssh root@10.0.20.201 "ha-manager remove vm:201"
+# Success - VM eliminat din HA
+```
+
+#### Pas 4: Force stop VM
+```bash
+ssh root@10.0.20.201 "qm stop 201"
+# Output: VM quit/powerdown failed - terminating now with SIGTERM
+#         VM still running - terminating now with SIGKILL
+```
+
+#### Pas 5: Verificare stop
+```bash
+ssh root@10.0.20.201 "qm status 201"
+# Output: status: stopped
+```
+
+#### Pas 6: Start VM
+```bash
+ssh root@10.0.20.201 "qm start 201"
+ssh root@10.0.20.201 "sleep 5 && qm status 201"
+# Output: status: running
+```
+
+#### Pas 7: Re-add în HA
+```bash
+ssh root@10.0.20.201 "ha-manager add vm:201"
+ssh root@10.0.20.201 "sleep 10 && ha-manager status | grep 201"
+# Output: service vm:201 (pvemini, started)
+```
+
+**Rezultat:** ✅ VM 201 funcțional și re-integrat în HA
+
+---
+
+## Problema 2: Storage backup-nfs - Status Unknown
+
+### Simptome
+- Storage backup-nfs apărea ca **unknown** în GUI
+- Toate comenzile care accesau `/mnt/pve/backup-nfs` înghețau
+- Timeout-uri la operații SSH pe pvemini
+- NFS mount exista dar era blocat
+
+### Diagnostic
+
+#### 1. Verificare status storage
+```bash
+ssh root@10.0.20.201 "pvesm status | grep backup"
+```
+
+**Output:**
+```
+backup             dir     active      1921724696       287855936      1536176700   14.98%
+backup-nfs         nfs   inactive               0               0               0    0.00%
+backup-ssd         dir   disabled               0               0               0      N/A
+got timeout
+unable to activate storage 'backup-nfs' - directory '/mnt/pve/backup-nfs' does not exist or is unreachable
+```
+
+#### 2. Verificare configurație storage
+```bash
+ssh root@10.0.20.201 "cat /etc/pve/storage.cfg | grep -A5 backup-nfs"
+```
+
+**Output:**
+```
+nfs: backup-nfs
+	export /mnt/backup
+	path /mnt/pve/backup-nfs
+	server 10.0.20.201
+	content rootdir,snippets,images,iso,import,vztmpl,backup
+```
+
+#### 3. Verificare mount point (TIMEOUT)
+```bash
+ssh root@10.0.20.201 "ls -ld /mnt/pve/backup-nfs"
+# Timeout după 2 minute - NFS blocat complet
+```
+
+#### 4. Verificare dacă este montat
+```bash
+ssh root@10.0.20.201 "mount | grep backup-nfs"
+```
+
+**Output:**
+```
+10.0.20.201:/mnt/backup on /mnt/pve/backup-nfs type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.20.201,local_lock=none,addr=10.0.20.201)
+```
+
+**Cauză identificată:** NFS server blocat pe pvemini - mount exista dar era complet non-responsive.
+
+#### 5. Verificare status servicii NFS
+```bash
+ssh root@10.0.20.201 "systemctl status nfs-server"
+# Active: active (exited) - dar non-functional
+```
+
+#### 6. Încercări de remediere (toate au eșuat cu timeout)
+```bash
+# Încercare unmount forțat
+ssh root@10.0.20.201 "umount -f /mnt/pve/backup-nfs"
+# device is busy
+
+# Încercare restart servicii NFS
+ssh root@10.0.20.201 "systemctl restart nfs-server"
+# Timeout după 30s
+
+# Încercare kill procese NFS
+ssh root@10.0.20.201 "pkill -9 nfs"
+# Timeout după 15s
+```
+
+### Rezolvare Backup-NFS
+
+#### Pas 1: Dezactivare storage din alt nod
+```bash
+ssh root@10.0.20.200 "pvesm set backup-nfs --disable 1"
+ssh root@10.0.20.200 "pvesm status | grep backup"
+```
+
+**Output:**
+```
+backup             dir   disabled
+backup-nfs         nfs   disabled
+backup-ssd         dir     active
+```
+
+#### Pas 2: Force reboot pvemini
+```bash
+# Încercare reboot normal (blocat)
+ssh root@10.0.20.201 "reboot" &
+# Nu a funcționat
+
+# Force reboot via sysrq
+ssh root@10.0.20.201 "echo 1 > /proc/sys/kernel/sysrq && echo b > /proc/sysrq-trigger" &
+# Output: "System is going down" - SUCCESS
+```
+
+#### Pas 3: Monitorizare reboot
+```bash
+for i in {1..60}; do
+  sleep 2
+  ping -c 1 -W 1 10.0.20.201 >/dev/null 2>&1 && echo "pvemini is back online!" && break || echo "Waiting... ($i/60)"
+done
+# Output: pvemini is back online! (după ~6 secunde)
+```
+
+#### Pas 4: Verificare după reboot
+```bash
+# Așteptare servicii Proxmox
+sleep 15
+
+# Verificare status storage
+ssh root@10.0.20.201 "pvesm status | grep backup-nfs"
+# Output: backup-nfs         nfs   disabled
+```
+
+#### Pas 5: Re-activare storage
+```bash
+ssh root@10.0.20.201 "pvesm set backup-nfs --disable 0"
+ssh root@10.0.20.201 "pvesm status | grep backup"
+```
+
+**Output:**
+```
+backup             dir     active      1921724696       287855936      1536176700   14.98%
+backup-nfs         nfs   inactive               0               0               0    0.00%
+```
+
+#### Pas 6: Verificare mount
+```bash
+ssh root@10.0.20.201 "mount | grep backup-nfs"
+```
+
+**Output:**
+```
+10.0.20.201:/mnt/backup on /mnt/pve/backup-nfs type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576)
+```
+
+#### Pas 7: Verificare accesibilitate
+```bash
+ssh root@10.0.20.201 "df -h /mnt/pve/backup-nfs"
+```
+
+**Output:**
+```
+Filesystem               Size  Used Avail Use% Mounted on
+10.0.20.201:/mnt/backup  1.8T  275G  1.5T  16% /mnt/pve/backup-nfs
+```
+
+#### Pas 8: Restart pvestatd pentru refresh
+```bash
+ssh root@10.0.20.201 "systemctl restart pvestatd"
+ssh root@10.0.20.201 "sleep 5 && pvesm status | grep backup-nfs"
+```
+
+**Output final:**
+```
+backup-nfs         nfs     active      1921725440       287856640      1536177152   14.98%
+```
+
+**Rezultat:** ✅ Storage backup-nfs funcțional și active
+
+---
+
+## Observații Suplimentare
+
+### VM/LXC nu au pornit automat după reboot
+Deși toate containerele și VM-urile cu `onboot: 1` nu au pornit imediat după reboot-ul forțat, acestea s-au recuperat automat după ce:
+- Cluster quorum s-a re-stabilit (3/3 noduri)
+- HA manager și-a recuperat starea
+- Storage-urile au devenit disponibile
+
+HA a fost conservativ după reboot-ul forțat, așteptând confirmarea stabilității cluster-ului înainte de a porni serviciile.
+
+---
+
+## Lecții Învățate
+
+### Despre Lock-uri VM
+1. Lock-urile de backup pot rămâne active dacă job-urile de backup sunt întrerupte brusc
+2. `qm unlock <VMID>` rezolvă lock-uri simple
+3. Pentru VM-uri în HA error state, este necesar să fie remove din HA înainte de intervenții
+
+### Despre NFS pe Proxmox
+1. **Evită self-mount NFS** - pvemini montează NFS de pe el însuși (10.0.20.201:/mnt/backup → 10.0.20.201:/mnt/pve/backup-nfs)
+2. Această configurație poate cauza deadlock-uri când NFS server-ul sau client-ul au probleme
+3. **Recomandare:** Mută NFS server-ul pe un nod dedicat sau NAS separate
+
+### Comenzi Utile pentru Diagnostic
+
+#### Verificare HA status
+```bash
+ha-manager status              # Overview complet HA
+ha-manager config              # Configurație HA resources
+cat /etc/pve/ha/resources.cfg # Fișier configurație HA
+journalctl -u pve-ha-lrm -f    # Log-uri HA Local Resource Manager
+```
+
+#### Verificare Lock-uri VM
+```bash
+qm config <VMID> | grep lock      # Verifică lock în config
+ls -lh /var/lock/qemu-server/     # Lock files pe disk
+qm unlock <VMID>                  # Remove lock
+qm stop <VMID> --skiplock         # Stop forțat ignorând lock
+```
+
+#### Verificare NFS
+```bash
+showmount -e <IP>                    # Export-uri disponibile
+pvesm nfsscan <IP>                   # Scan NFS via Proxmox
+mount | grep nfs                     # Mount-uri NFS active
+df -h <mount_point>                  # Test accesibilitate mount
+systemctl status nfs-server          # Status NFS server
+systemctl status nfs-client.target   # Status NFS client
+```
+
+#### Force Reboot când SSH-ul este blocat
+```bash
+# Via sysrq (cel mai safe force reboot)
+ssh root@<IP> "echo 1 > /proc/sys/kernel/sysrq && echo b > /proc/sysrq-trigger" &
+
+# Via IPMI/iLO (dacă disponibil)
+ipmitool -I lanplus -H <IPMI_IP> -U <user> -P <pass> power reset
+```
+
+---
+
+## Preventie
+
+### Pentru VM Lock Issues
+1. **Monitorizează job-urile de backup** - verifică că se termină corect
+2. **Test backup recovery** - periodic test restore pentru validare
+3. **Configurează timeout-uri** adecvate pentru backup-uri mari
+4. **Enable HA doar pentru VM-uri critice** - nu toate VM-urile necesită HA
+
+### Pentru Storage NFS
+1. **Separă NFS server de client** - nu monta NFS de pe același host
+2. **Monitorizează NFS timeouts** în log-uri
+3. **Configurează soft mount** în loc de hard mount pentru non-critical storage
+4. **Test periodic** accesibilitatea storage-urilor NFS
+
+### Monitorizare Preventivă
+```bash
+# Script verificare lock-uri VM
+for vm in $(qm list | awk 'NR>1 {print $1}'); do
+  if qm config $vm | grep -q "^lock:"; then
+    echo "WARNING: VM $vm has lock: $(qm config $vm | grep '^lock:')"
+  fi
+done
+
+# Script verificare NFS health
+for nfs in $(pvesm status | grep nfs | awk '{print $1}'); do
+  if ! pvesm list $nfs &>/dev/null; then
+    echo "ERROR: Storage $nfs not accessible"
+  fi
+done
+```
+
+---
+
+## Rezumat Comenzi Executate
+
+### Rezolvare VM 201
+```bash
+ssh root@10.0.20.201 "qm unlock 201"
+ssh root@10.0.20.201 "ha-manager remove vm:201"
+ssh root@10.0.20.201 "qm stop 201"
+ssh root@10.0.20.201 "qm start 201"
+ssh root@10.0.20.201 "ha-manager add vm:201"
+```
+
+### Rezolvare backup-nfs
+```bash
+ssh root@10.0.20.200 "pvesm set backup-nfs --disable 1"
+ssh root@10.0.20.201 "echo 1 > /proc/sys/kernel/sysrq && echo b > /proc/sysrq-trigger" &
+# Așteptare reboot
+ssh root@10.0.20.201 "pvesm set backup-nfs --disable 0"
+ssh root@10.0.20.201 "systemctl restart pvestatd"
+```
+
+**Timp total rezolvare:** ~15 minute (incluzând reboot-ul)