Following the 2026-04-20 cluster outage, the cluster README now covers HA resource limits, corosync token tuning (10s tolerance for USB glitches), rasdaemon/netconsole/kdump diagnostic stack on pvemini, mail relay via mail.romfast.ro with SMTP auth, OOM alerting via cron, and swap on pveelite. VM 109 README now clearly states it was removed from HA and is only started by the weekly DR test script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VM 109 - Oracle DR System (Windows Standby)
Director Proxmox: proxmox/vm109-windows-dr/
VMID: 109
Rol: Disaster Recovery pentru Oracle Database (backup RMAN de pe server Windows extern)
⚠️ Important — VM 109 NU este în HA (din 2026-04-20)
După incidentul 2026-04-20 (vezi ../cluster/incidents/2026-04-20-cluster-outage.md), VM 109 a fost scos din HA cu ha-manager remove vm:109. Motivele:
- VM 109 este un DR test VM, nu un serviciu live
- Scriptul DR test de sâmbătă (
scripts/weekly-dr-test-proxmox.sh) pornește/oprește VM 109 manual cuqm start/stop - Cu HA activ, un bug
set -eîn script a lăsat VM 109 pornit 2.5 zile, apoi la crashul pvemini HA a relocat VM 109 pe pveelite (16 GB) → OOM cascade
Efecte:
- VM 109 NU mai e repornit automat la crash node
- VM 109 NU se mai mută de pe pvemini
- VM 109 pornește DOAR la invocarea scriptului DR sau manual cu
qm start 109 - Scriptul DR are acum
trap cleanup_vm EXITcare garanteazăqm stop 109la orice ieșire
Verificare status:
ssh root@10.0.20.201 "qm status 109" # trebuie stopped
ssh root@10.0.20.201 "ha-manager status | grep 109 || echo 'nu e în HA'"
🛡️ Oracle DR System - Complete Architecture
📊 System Overview
┌─────────────────────────────────────────────────────────────────┐
│ PRODUCTION ENVIRONMENT │
├─────────────────────────────────────────────────────────────────┤
│ PRIMARY SERVER (10.0.20.36) │
│ Windows Server + Oracle 19c │
│ ┌──────────────────────────────┐ │
│ │ Database: ROA │ │
│ │ Size: ~80 GB │ │
│ │ Tables: 42,625 │ │
│ └──────────────────────────────┘ │
│ │ │
│ ▼ Backups (Daily) │
│ ┌──────────────────────────────┐ │
│ │ 02:30 - FULL backup (6-7 GB) │ │
│ │ 13:00 - CUMULATIVE (200 MB) │ │
│ │ 18:00 - CUMULATIVE (300 MB) │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
│ SSH Transfer (Port 22)
▼
┌─────────────────────────────────────────────────────────────────┐
│ DR ENVIRONMENT │
├─────────────────────────────────────────────────────────────────┤
│ PROXMOX HOST (10.0.20.202 - pveelite) │
│ ┌──────────────────────────────┐ │
│ │ Backup Storage (NFS Server) │◄─────── Monitoring Scripts │
│ │ /mnt/pve/oracle-backups/ │ /opt/scripts/ │
│ │ └── ROA/autobackup/ │ │
│ └──────────────────────────────┘ │
│ │ │
│ │ NFS Mount (F:\) │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ DR VM 109 (10.0.20.37) │ │
│ │ Windows Server + Oracle 19c │ │
│ │ Status: OFF (normally) │ │
│ │ Starts for: Tests or Disaster │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
🎯 Quick Actions
⚡ Emergency DR Activation (Production Down!)
# 1. Start DR VM
ssh root@10.0.20.202 "qm start 109"
# 2. Connect to VM (wait 3 min for boot)
ssh -p 22122 romfast@10.0.20.37
# 3. Run restore (takes ~10-15 minutes)
D:\oracle\scripts\rman_restore_from_zero.cmd
# 4. Database is now RUNNING - Update app connections to 10.0.20.37
🧪 Weekly Test (Every Saturday)
# Automatic at 06:00 via cron, or manual:
ssh root@10.0.20.202 "/opt/scripts/weekly-dr-test-proxmox.sh"
# What it does:
# ✓ Starts VM → Restores DB → Tests → Cleanup → Shutdown
# ✓ Sends email report with results
📊 Check Backup Health
# Manual check (runs daily at 09:00 automatically)
ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh"
# Output:
# Status: OK
# FULL backup age: 11 hours ✓
# CUMULATIVE backup age: 2 hours ✓
# Disk usage: 45% ✓
🗂️ Component Locations
📁 PRIMARY Server (10.0.20.36)
D:\rman_backup\
├── rman_backup_full.txt # RMAN script for FULL backup
├── rman_backup_incremental.txt # RMAN script for CUMULATIVE
└── transfer_backups.ps1 # UNIFIED: Transfer ALL backups to Proxmox
Scheduled Tasks:
├── 02:30 - Oracle RMAN Full Backup
├── 03:00 - Transfer backups to DR (transfer_backups.ps1)
├── 13:00 - Oracle RMAN Cumulative Backup
├── 14:45 - Transfer backups to DR (transfer_backups.ps1)
└── 18:00 - Oracle RMAN Cumulative Backup
📁 PROXMOX Host (10.0.20.202)
/opt/scripts/
├── oracle-backup-monitor-proxmox.sh # Daily backup monitoring
├── weekly-dr-test-proxmox.sh # Weekly DR test
└── PROXMOX_NOTIFICATIONS_README.md # Documentation
/mnt/pve/oracle-backups/ROA/autobackup/
├── FULL_20251010_023001.BKP # Latest FULL backup
├── INCR_20251010_130001.BKP # CUMULATIVE 13:00
└── INCR_20251010_180001.BKP # CUMULATIVE 18:00
Cron Jobs:
0 9 * * * /opt/scripts/oracle-backup-monitor-proxmox.sh
0 6 * * 6 /opt/scripts/weekly-dr-test-proxmox.sh
📁 DR VM 109 (10.0.20.37) - When Running
D:\oracle\scripts\
├── rman_restore_from_zero.cmd # Main restore script ⭐
├── cleanup_database.cmd # Cleanup after test
└── mount-nfs.bat # Mount F:\ at startup
F:\ (NFS mount from Proxmox)
└── ROA\autobackup\ # All backup files
🔄 How It Works
Backup Flow (Daily)
PRIMARY PROXMOX
│ │
├─02:30─FULL─Backup─────────────►
│ (6-7 GB) │
├─03:00─Transfer ALL────────────► Skip duplicates
│ (transfer_backups.ps1) │
│ │
├─13:00─CUMULATIVE──────────────►
│ (200 MB) │
├─14:45─Transfer ALL────────────► Skip duplicates
│ (transfer_backups.ps1) │ (only new files)
│ │
└─18:00─CUMULATIVE──────────────►
(300 MB) Storage
│
┌──────────┐
│ Monitor │ 09:00 Daily
│ Check Age│ Alert if old
└──────────┘
Restore Process
Start VM → Mount F:\ → Copy Backups → RMAN Restore → Database OPEN
2min Auto 2min 8min Ready!
Total Time: ~15 minutes
🔧 Manual Operations
Test Individual Components
# 1. Test backup transfer (on PRIMARY)
powershell -ExecutionPolicy Bypass -File "D:\rman_backup\transfer_backups.ps1"
# 2. Test NFS mount (on VM 109)
mount -o rw,nolock,mtype=hard,timeout=60 10.0.20.202:/mnt/pve/oracle-backups F:
dir F:\ROA\autobackup
# 3. Test notification system
ssh root@10.0.20.202 "touch -d '2 days ago' /mnt/pve/oracle-backups/ROA/autobackup/*FULL*.BKP"
ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh"
# Should send WARNING notification
# 4. Test database restore (on VM 109)
D:\oracle\scripts\rman_restore_from_zero.cmd
Force Actions
# Force backup now (on PRIMARY)
rman cmdfile=D:\rman_backup\rman_backup_incremental.txt
# Force cleanup VM (on VM 109)
D:\oracle\scripts\cleanup_database.cmd
# Force VM shutdown
ssh root@10.0.20.202 "qm stop 109"
🐛 Troubleshooting
🔍 Debugging Restore Tests
Check Backup Files on Proxmox (10.0.20.202)
# 1. List all backup files with size and date
ssh root@10.0.20.202 "ls -lht /mnt/pve/oracle-backups/ROA/autobackup/*.BKP"
# 2. Count backup files
ssh root@10.0.20.202 "ls /mnt/pve/oracle-backups/ROA/autobackup/*.BKP | wc -l"
# 3. Check latest backups (last 24 hours)
ssh root@10.0.20.202 "find /mnt/pve/oracle-backups/ROA/autobackup -name '*.BKP' -mtime -1 -ls"
# 4. Show backup files grouped by type (with new naming convention)
ssh root@10.0.20.202 "ls -lh /mnt/pve/oracle-backups/ROA/autobackup/ | grep -E '(L0_|L1_|ARC_|SPFILE_|CF_|O1_MF)'"
# 5. Check disk space usage
ssh root@10.0.20.202 "df -h /mnt/pve/oracle-backups"
ssh root@10.0.20.202 "du -sh /mnt/pve/oracle-backups/ROA/autobackup/"
# 6. Verify newest backup timestamp
ssh root@10.0.20.202 "stat /mnt/pve/oracle-backups/ROA/autobackup/L0_*.BKP 2>/dev/null | grep Modify || echo 'No L0 backups with new naming'"
Verify Backup Files on DR VM (when running)
# 1. Check NFS mount is accessible
Test-Path F:\ROA\autobackup
# 2. List all backup files
Get-ChildItem F:\ROA\autobackup\*.BKP | Format-Table Name, Length, LastWriteTime
# 3. Count backup files
(Get-ChildItem F:\ROA\autobackup\*.BKP).Count
# 4. Show total backup size
"{0:N2} GB" -f ((Get-ChildItem F:\ROA\autobackup\*.BKP | Measure-Object -Property Length -Sum).Sum / 1GB)
# 5. Check latest Level 0 backup
Get-ChildItem F:\ROA\autobackup\L0_*.BKP -ErrorAction SilentlyContinue | Sort-Object LastWriteTime -Descending | Select-Object -First 1
# 6. Check what was copied during last restore
Get-Content D:\oracle\logs\restore_from_zero.log | Select-String "Copying|Copied"
Check DR Test Results
# 1. View latest DR test log
ssh root@10.0.20.202 "ls -lt /var/log/oracle-dr/dr_test_*.log | head -1 | awk '{print \$9}' | xargs cat | tail -100"
# 2. Check test status (passed/failed)
ssh root@10.0.20.202 "grep -E 'PASSED|FAILED|Database Verification' /var/log/oracle-dr/dr_test_*.log | tail -5"
# 3. See backup selection logic output
ssh root@10.0.20.202 "grep -A5 'TEST MODE: Selecting' /var/log/oracle-dr/dr_test_*.log | tail -20"
# 4. Check how many files were selected
ssh root@10.0.20.202 "grep 'Total files selected' /var/log/oracle-dr/dr_test_*.log | tail -1"
# 5. View RMAN errors (if any)
ssh root@10.0.20.202 "grep -i 'RMAN-\|ORA-' /var/log/oracle-dr/dr_test_*.log | tail -20"
Simulate Test Locally (on DR VM)
# 1. Start Oracle service manually
Start-Service OracleServiceROA
# 2. Run cleanup to prepare for restore
D:\oracle\scripts\cleanup_database.ps1 /SILENT
# 3. Run restore in test mode
D:\oracle\scripts\rman_restore_from_zero.ps1 -TestMode
# 4. Verify database opened correctly
sqlplus / as sysdba @D:\oracle\scripts\verify_db.sql
# 5. Check what backups were used
Get-Content D:\oracle\logs\restore_from_zero.log | Select-String "backup piece"
# 6. View database verification output
Get-Content D:\oracle\logs\restore_from_zero.log | Select-String -Pattern "DB_NAME|OPEN_MODE|TABLES" -Context 0,1
Common Restore Test Issues
| Issue | Check | Fix |
|---|---|---|
| Test reports FAILED but DB is open | Check log for "OPEN_MODE: READ WRITE" | Already fixed in latest version |
| Missing datafiles in restore | Count backup files: should be 15-40+ | Wait for next full backup or copy all files |
| "No backups found" error | Verify NFS mount: Test-Path F:\ |
Remount NFS or check Proxmox NFS service |
| Restore takes > 30 min | Check backup size: should be ~5-8 GB | Normal for first restore after format change |
| RMAN-06023 errors | Check for L0_*.BKP files on F:\ | Old format: need new backup with naming convention |
Verify Naming Convention is Active
# Check if new naming convention is being used (after Oct 11, 2025)
ssh root@10.0.20.202 "ls /mnt/pve/oracle-backups/ROA/autobackup/ | grep -E '^(L0_|L1_|ARC_|SPFILE_|CF_)' | wc -l"
# Should return > 0 if active
# If 0, backups are still using old format (O1_MF_ANNNN_*)
# Wait for next scheduled backup (02:30 daily) or run manual backup
Manual Test Run with Verbose Output
# Run test with full output visible
ssh root@10.0.20.202
cd /opt/scripts
./weekly-dr-test-proxmox.sh 2>&1 | tee /tmp/dr_test_manual.log
# Watch in real-time what's happening
# Look for these key stages:
# - "TEST MODE: Selecting latest backup set"
# - "Total files selected: XX"
# - "RMAN restore completed successfully"
# - "OPEN_MODE: READ WRITE"
❌ Backup Monitor Not Sending Alerts
# 1. Check templates exist
ssh root@10.0.20.202 "ls /usr/share/pve-manager/templates/default/oracle-*"
# 2. Reinstall templates
ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh --install"
# 3. Check Proxmox notifications work
ssh root@10.0.20.202 "pvesh create /nodes/$(hostname)/apt/update"
# Should receive update notification
❌ F:\ Drive Not Accessible in VM
# On VM 109:
# 1. Check NFS Client service
Get-Service | Where {$_.Name -like "*NFS*"}
# 2. Manual mount
mount -o rw,nolock,mtype=hard,timeout=60 10.0.20.202:/mnt/pve/oracle-backups F:
# 3. Check Proxmox NFS server
ssh root@10.0.20.202 "showmount -e localhost"
# Should show: /mnt/pve/oracle-backups 10.0.20.37
❌ Restore Fails
# 1. Check backup files exist
dir F:\ROA\autobackup\*.BKP
# 2. Check Oracle service
sc query OracleServiceROA
# 3. Check PFILE exists
dir C:\Users\oracle\admin\ROA\pfile\initROA.ora
# 4. View restore log
type D:\oracle\logs\restore_from_zero.log
❌ VM Won't Start
# Check VM status
ssh root@10.0.20.202 "qm status 109"
# Check VM config
ssh root@10.0.20.202 "qm config 109 | grep -E 'memory|cores|bootdisk'"
# Force unlock if locked
ssh root@10.0.20.202 "qm unlock 109"
# Start with console
ssh root@10.0.20.202 "qm start 109 && qm terminal 109"
📈 Monitoring & Metrics
Key Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| FULL Backup Age | < 24h | > 25h |
| CUMULATIVE Age | < 6h | > 7h |
| Backup Size | ~7 GB/day | > 10 GB |
| Restore Time | < 15 min | > 30 min |
| Disk Usage | < 80% | > 80% |
Check Logs
# Backup logs (on PRIMARY)
Get-Content D:\rman_backup\logs\backup_*.log -Tail 50
# Transfer logs (on PRIMARY) - UNIFIED script
Get-Content D:\rman_backup\logs\transfer_*.log -Tail 50
# Monitoring logs (on Proxmox)
tail -50 /var/log/oracle-dr/*.log
# Restore logs (on VM 109)
type D:\oracle\logs\restore_from_zero.log
🔐 Security & Access
SSH Keys Setup
PRIMARY (10.0.20.36) ──────► PROXMOX (10.0.20.202)
SSH Key
Port 22
LINUX WORKSTATION ─────────► PROXMOX (10.0.20.202)
SSH Key
Port 22
LINUX WORKSTATION ─────────► VM 109 (10.0.20.37)
SSH Key
Port 22122
Required Credentials
- PRIMARY: Administrator (for scheduled tasks)
- PROXMOX: root (for scripts and VM control)
- VM 109: romfast (user), SYSTEM (Oracle service)
📅 Maintenance Schedule
| Day | Time | Action | Duration | Impact |
|---|---|---|---|---|
| Daily | 02:30 | FULL Backup | 30 min | None |
| Daily | 09:00 | Monitor Backups | 1 min | None |
| Daily | 13:00 | CUMULATIVE Backup | 5 min | None |
| Daily | 18:00 | CUMULATIVE Backup | 5 min | None |
| Saturday | 06:00 | DR Test | 30 min | None |
🚨 Disaster Recovery Procedure
When PRIMARY is DOWN:
-
Confirm PRIMARY is unreachable
ping 10.0.20.36 # Should fail -
Start DR VM
ssh root@10.0.20.202 "qm start 109" -
Wait for boot (3 minutes)
-
Connect to DR VM
ssh -p 22122 romfast@10.0.20.37 -
Run restore
D:\oracle\scripts\rman_restore_from_zero.cmd -
Verify database
sqlplus / as sysdba SELECT name, open_mode FROM v$database; -- Should show: ROA, READ WRITE -
Update application connections
- Change from: 10.0.20.36:1521/ROA
- Change to: 10.0.20.37:1521/ROA
-
Monitor DR system
- Database is now production
- Do NOT run cleanup!
- Keep VM running
📝 Quick Reference Card
╔══════════════════════════════════════════════════════════════╗
║ DR QUICK REFERENCE ║
╠══════════════════════════════════════════════════════════════╣
║ PRIMARY DOWN? ║
║ ssh root@10.0.20.202 ║
║ qm start 109 ║
║ # Wait 3 min ║
║ ssh -p 22122 romfast@10.0.20.37 ║
║ D:\oracle\scripts\rman_restore_from_zero.cmd ║
╠══════════════════════════════════════════════════════════════╣
║ TEST DR? ║
║ ssh root@10.0.20.202 "/opt/scripts/weekly-dr-test-proxmox.sh"║
╠══════════════════════════════════════════════════════════════╣
║ CHECK BACKUPS? ║
║ ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh"║
╠══════════════════════════════════════════════════════════════╣
║ SUPPORT: ║
║ Logs: /var/log/oracle-dr/ ║
║ Docs: proxmox/vm109-windows-dr/docs/ ║
╚══════════════════════════════════════════════════════════════╝
📂 Structură Director
vm109-windows-dr/
├── README.md # Acest fișier
├── docs/
│ ├── PLAN_TESTARE_MONITORIZARE.md # Plan testare și monitorizare DR
│ ├── PROXMOX_NOTIFICATIONS_README.md # Configurare notificări Proxmox
│ └── archive/ # Planuri și statusuri anterioare
│ ├── DR_UPGRADE_TO_CUMULATIVE_PLAN.md
│ ├── DR_VM_MIGRATION_GUIDE.md
│ ├── DR_WINDOWS_VM_IMPLEMENTATION_PLAN.md
│ └── DR_WINDOWS_VM_STATUS_2025-10-09.md
└── scripts/
├── oracle-backup-monitor-proxmox.sh # Monitorizare zilnică (Proxmox)
├── weekly-dr-test-proxmox.sh # Test săptămânal DR (Proxmox)
├── rman_backup.bat # RMAN full backup (Windows)
├── rman_backup_incremental.bat # RMAN incremental (Windows)
├── transfer_backups.ps1 # Transfer backup-uri (Windows)
├── rman_restore_from_zero.ps1 # Restore complet (Windows DR)
├── cleanup_database.ps1 # Cleanup după test (Windows DR)
└── *.ps1 # Alte scripturi configurare
Last Updated: 2026-01-27 Version: 2.2 - Unified transfer script (transfer_backups.ps1) Status: ✅ Production Ready
📋 Changelog
v2.2 (Oct 31, 2025)
- ✨ Unified transfer script: Replaced
transfer_to_dr.ps1andtransfer_incremental.ps1with singletransfer_backups.ps1 - 🎯 Smart duplicate detection: Automatically skips files that exist on DR
- ⚡ Flexible scheduling: Can run after any backup type or manually
- 🔧 Simplified maintenance: One script to maintain instead of two
v2.1 (Oct 11, 2025)
- Added restore test debugging guide
- Implemented new backup naming convention