# VM 109 - Oracle DR System (Windows Standby) **Director Proxmox:** `proxmox/vm109-windows-dr/` **VMID:** 109 **Rol:** Disaster Recovery pentru Oracle Database (backup RMAN de pe server Windows extern) ## ⚠️ Important — Topologie după 2026-04-25 VM 109 trăiește pe **pveelite** (10.0.20.202), co-located cu storage-ul NFS Oracle backups. Configurația post-incident 04-20: - **VM 109 în HA, grup `ha-prefer-pveelite`** (pveelite=100, pvemini=50, pve1=10), `state=stopped`, `nofailback=1` — HA face failover dacă pveelite cade dar nu repornește VM 109 automat (rămâne stopped, scriptul DR îl pornește săptămânal). - **Apărări împotriva incident 04-20**: - `trap cleanup_vm EXIT` în scriptul DR (commit 8a0c557) cu guard `DR_VM_STARTED_BY_US` (commit 2e8cd9c) — oprește VM 109 doar dacă scriptul l-a pornit. - `vm109-watchdog.sh` cron pe ambele pveelite + pvemini (cluster-aware) — oprește forțat VM 109 dacă rulează > 60 min în afara ferestrei test (Sâmbătă 05:55-07:30). Debug exempt: `touch /var/run/vm109-debug.flag`. - Pre-flight check în DR script: refuză `qm start 109` dacă cluster degraded sau memorie disponibilă < (VM 109 mem + 1 GB margin). - `max_restart=3, max_relocate=2` pe toate serviciile HA — cap pe restart loops la OOM. **Verificare status:** ```bash ssh root@10.0.20.201 "ha-manager status | grep -E '109|201|108'" ssh root@10.0.20.202 "qm status 109" # trebuie stopped între teste ``` ## 🔄 Storage Failover (pveelite → pvemini) `/mnt/pve/oracle-backups` e dataset ZFS replicat pveelite → pvemini la 15 min (`zfs-replicate-oracle-backups.sh`) + mirror nightly pe pve1 backup-ssd (`nightly-backup-mirror.sh`). La pveelite down: 1. **Email automat** din `pveelite-down-alert.sh` (cron pe pvemini, prag 5 min) cu instrucțiuni de failover copy-paste. 2. Operator rulează pe pvemini: `/opt/scripts/failover-dr-to-pvemini.sh` — promote ZFS readonly → off, configurează NFS export, patch primary Oracle scheduled task IP via SSH. 3. Când pveelite revine: `/opt/scripts/failback-dr-to-pveelite.sh` — invers, cu zfs send incremental + restaurare config. Script-urile refuză să ruleze dacă cealaltă parte e accesibilă (anti-split-brain). --- # 🛡️ Oracle DR System - Complete Architecture ## 📊 System Overview ``` ┌─────────────────────────────────────────────────────────────────┐ │ PRODUCTION ENVIRONMENT │ ├─────────────────────────────────────────────────────────────────┤ │ PRIMARY SERVER (10.0.20.36) │ │ Windows Server + Oracle 19c │ │ ┌──────────────────────────────┐ │ │ │ Database: ROA │ │ │ │ Size: ~80 GB │ │ │ │ Tables: 42,625 │ │ │ └──────────────────────────────┘ │ │ │ │ │ ▼ Backups (Daily) │ │ ┌──────────────────────────────┐ │ │ │ 02:30 - FULL backup (6-7 GB) │ │ │ │ 13:00 - CUMULATIVE (200 MB) │ │ │ │ 18:00 - CUMULATIVE (300 MB) │ │ │ └──────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ │ SSH Transfer (Port 22) ▼ ┌─────────────────────────────────────────────────────────────────┐ │ DR ENVIRONMENT │ ├─────────────────────────────────────────────────────────────────┤ │ PROXMOX HOST (10.0.20.202 - pveelite) │ │ ┌──────────────────────────────┐ │ │ │ Backup Storage (NFS Server) │◄─────── Monitoring Scripts │ │ │ /mnt/pve/oracle-backups/ │ /opt/scripts/ │ │ │ └── ROA/autobackup/ │ │ │ └──────────────────────────────┘ │ │ │ │ │ │ NFS Mount (F:\) │ │ ▼ │ │ ┌──────────────────────────────┐ │ │ │ DR VM 109 (10.0.20.37) │ │ │ │ Windows Server + Oracle 19c │ │ │ │ Status: OFF (normally) │ │ │ │ Starts for: Tests or Disaster │ │ │ └──────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ ``` ## 🎯 Quick Actions ### ⚡ Emergency DR Activation (Production Down!) ```bash # 1. Start DR VM ssh root@10.0.20.202 "qm start 109" # 2. Connect to VM (wait 3 min for boot) ssh -p 22122 romfast@10.0.20.37 # 3. Run restore (takes ~10-15 minutes) D:\oracle\scripts\rman_restore_from_zero.cmd # 4. Database is now RUNNING - Update app connections to 10.0.20.37 ``` ### 🔄 Failback DR → PRIMARY (when production is repaired) Procedura inversă, pentru când serverul de producție a fost reparat sau reinstalat și trebuie mutată producția înapoi pe `10.0.20.36`: > ⚠️ **Pe PRIMARY instalează Oracle 19c (NU 21c) pentru failback acut.** Backup-urile sunt 19.3. 21c poate restore tehnic, dar cere upgrade-of-dictionary suplimentar (~30-60 min în plus) — risc inutil în fereastra de criză. Migrarea la 21c se face separat după failback. Detalii în `FAILBACK_PROCEDURE.md`. ➡️ Vezi **[docs/FAILBACK_PROCEDURE.md](docs/FAILBACK_PROCEDURE.md)** — pași end-to-end: - Backup final pe DR (cu DB în read-only / restricted) - Restore pe PRIMARY nou cu `scripts/rman_restore_to_primary.ps1` - Switch connection strings + reactivare scheduled tasks RMAN - Stop VM 109, revenire la state normal ### 🧪 Weekly Test (Every Saturday) ```bash # Automatic at 06:00 via cron, or manual: ssh root@10.0.20.202 "/opt/scripts/weekly-dr-test-proxmox.sh" # What it does: # ✓ Starts VM → Restores DB → Tests → Cleanup → Shutdown # ✓ Sends email report with results ``` ### 📊 Check Backup Health ```bash # Manual check (runs daily at 09:00 automatically) ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh" # Output: # Status: OK # FULL backup age: 11 hours ✓ # CUMULATIVE backup age: 2 hours ✓ # Disk usage: 45% ✓ ``` ## 🗂️ Component Locations ### 📁 PRIMARY Server (10.0.20.36) ``` D:\rman_backup\ ├── rman_backup_full.txt # RMAN script for FULL backup ├── rman_backup_incremental.txt # RMAN script for CUMULATIVE └── transfer_backups.ps1 # UNIFIED: Transfer ALL backups to Proxmox Scheduled Tasks: ├── 02:30 - Oracle RMAN Full Backup ├── 03:00 - Transfer backups to DR (transfer_backups.ps1) ├── 13:00 - Oracle RMAN Cumulative Backup ├── 14:45 - Transfer backups to DR (transfer_backups.ps1) └── 18:00 - Oracle RMAN Cumulative Backup ``` ### 📁 PROXMOX Host (10.0.20.202) ``` /opt/scripts/ ├── oracle-backup-monitor-proxmox.sh # Daily backup monitoring ├── weekly-dr-test-proxmox.sh # Weekly DR test └── PROXMOX_NOTIFICATIONS_README.md # Documentation /mnt/pve/oracle-backups/ROA/autobackup/ ├── FULL_20251010_023001.BKP # Latest FULL backup ├── INCR_20251010_130001.BKP # CUMULATIVE 13:00 └── INCR_20251010_180001.BKP # CUMULATIVE 18:00 Cron Jobs: 0 9 * * * /opt/scripts/oracle-backup-monitor-proxmox.sh 0 6 * * 6 /opt/scripts/weekly-dr-test-proxmox.sh ``` ### 📁 DR VM 109 (10.0.20.37) - When Running ``` D:\oracle\scripts\ ├── rman_restore_from_zero.cmd # Main restore script ⭐ ├── cleanup_database.cmd # Cleanup after test └── mount-nfs.bat # Mount F:\ at startup F:\ (NFS mount from Proxmox) └── ROA\autobackup\ # All backup files ``` ## 🔄 How It Works ### Backup Flow (Daily) ``` PRIMARY PROXMOX │ │ ├─02:30─FULL─Backup─────────────► │ (6-7 GB) │ ├─03:00─Transfer ALL────────────► Skip duplicates │ (transfer_backups.ps1) │ │ │ ├─13:00─CUMULATIVE──────────────► │ (200 MB) │ ├─14:45─Transfer ALL────────────► Skip duplicates │ (transfer_backups.ps1) │ (only new files) │ │ └─18:00─CUMULATIVE──────────────► (300 MB) Storage │ ┌──────────┐ │ Monitor │ 09:00 Daily │ Check Age│ Alert if old └──────────┘ ``` ### Restore Process ``` Start VM → Mount F:\ → Copy Backups → RMAN Restore → Database OPEN 2min Auto 2min 8min Ready! Total Time: ~15 minutes ``` ## 🔧 Manual Operations ### Test Individual Components ```bash # 1. Test backup transfer (on PRIMARY) powershell -ExecutionPolicy Bypass -File "D:\rman_backup\transfer_backups.ps1" # 2. Test NFS mount (on VM 109) mount -o rw,nolock,mtype=hard,timeout=60 10.0.20.202:/mnt/pve/oracle-backups F: dir F:\ROA\autobackup # 3. Test notification system ssh root@10.0.20.202 "touch -d '2 days ago' /mnt/pve/oracle-backups/ROA/autobackup/*FULL*.BKP" ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh" # Should send WARNING notification # 4. Test database restore (on VM 109) D:\oracle\scripts\rman_restore_from_zero.cmd ``` ### Force Actions ```bash # Force backup now (on PRIMARY) rman cmdfile=D:\rman_backup\rman_backup_incremental.txt # Force cleanup VM (on VM 109) D:\oracle\scripts\cleanup_database.cmd # Force VM shutdown ssh root@10.0.20.202 "qm stop 109" ``` ## 🐛 Troubleshooting ### 🔍 Debugging Restore Tests #### Check Backup Files on Proxmox (10.0.20.202) ```bash # 1. List all backup files with size and date ssh root@10.0.20.202 "ls -lht /mnt/pve/oracle-backups/ROA/autobackup/*.BKP" # 2. Count backup files ssh root@10.0.20.202 "ls /mnt/pve/oracle-backups/ROA/autobackup/*.BKP | wc -l" # 3. Check latest backups (last 24 hours) ssh root@10.0.20.202 "find /mnt/pve/oracle-backups/ROA/autobackup -name '*.BKP' -mtime -1 -ls" # 4. Show backup files grouped by type (with new naming convention) ssh root@10.0.20.202 "ls -lh /mnt/pve/oracle-backups/ROA/autobackup/ | grep -E '(L0_|L1_|ARC_|SPFILE_|CF_|O1_MF)'" # 5. Check disk space usage ssh root@10.0.20.202 "df -h /mnt/pve/oracle-backups" ssh root@10.0.20.202 "du -sh /mnt/pve/oracle-backups/ROA/autobackup/" # 6. Verify newest backup timestamp ssh root@10.0.20.202 "stat /mnt/pve/oracle-backups/ROA/autobackup/L0_*.BKP 2>/dev/null | grep Modify || echo 'No L0 backups with new naming'" ``` #### Verify Backup Files on DR VM (when running) ```powershell # 1. Check NFS mount is accessible Test-Path F:\ROA\autobackup # 2. List all backup files Get-ChildItem F:\ROA\autobackup\*.BKP | Format-Table Name, Length, LastWriteTime # 3. Count backup files (Get-ChildItem F:\ROA\autobackup\*.BKP).Count # 4. Show total backup size "{0:N2} GB" -f ((Get-ChildItem F:\ROA\autobackup\*.BKP | Measure-Object -Property Length -Sum).Sum / 1GB) # 5. Check latest Level 0 backup Get-ChildItem F:\ROA\autobackup\L0_*.BKP -ErrorAction SilentlyContinue | Sort-Object LastWriteTime -Descending | Select-Object -First 1 # 6. Check what was copied during last restore Get-Content D:\oracle\logs\restore_from_zero.log | Select-String "Copying|Copied" ``` #### Check DR Test Results ```bash # 1. View latest DR test log ssh root@10.0.20.202 "ls -lt /var/log/oracle-dr/dr_test_*.log | head -1 | awk '{print \$9}' | xargs cat | tail -100" # 2. Check test status (passed/failed) ssh root@10.0.20.202 "grep -E 'PASSED|FAILED|Database Verification' /var/log/oracle-dr/dr_test_*.log | tail -5" # 3. See backup selection logic output ssh root@10.0.20.202 "grep -A5 'TEST MODE: Selecting' /var/log/oracle-dr/dr_test_*.log | tail -20" # 4. Check how many files were selected ssh root@10.0.20.202 "grep 'Total files selected' /var/log/oracle-dr/dr_test_*.log | tail -1" # 5. View RMAN errors (if any) ssh root@10.0.20.202 "grep -i 'RMAN-\|ORA-' /var/log/oracle-dr/dr_test_*.log | tail -20" ``` #### Simulate Test Locally (on DR VM) ```powershell # 1. Start Oracle service manually Start-Service OracleServiceROA # 2. Run cleanup to prepare for restore D:\oracle\scripts\cleanup_database.ps1 /SILENT # 3. Run restore in test mode D:\oracle\scripts\rman_restore_from_zero.ps1 -TestMode # 4. Verify database opened correctly sqlplus / as sysdba @D:\oracle\scripts\verify_db.sql # 5. Check what backups were used Get-Content D:\oracle\logs\restore_from_zero.log | Select-String "backup piece" # 6. View database verification output Get-Content D:\oracle\logs\restore_from_zero.log | Select-String -Pattern "DB_NAME|OPEN_MODE|TABLES" -Context 0,1 ``` #### Common Restore Test Issues | Issue | Check | Fix | |-------|-------|-----| | Test reports FAILED but DB is open | Check log for "OPEN_MODE: READ WRITE" | Already fixed in latest version | | Missing datafiles in restore | Count backup files: should be 15-40+ | Wait for next full backup or copy all files | | "No backups found" error | Verify NFS mount: `Test-Path F:\` | Remount NFS or check Proxmox NFS service | | Restore takes > 30 min | Check backup size: should be ~5-8 GB | Normal for first restore after format change | | RMAN-06023 errors | Check for L0_*.BKP files on F:\ | Old format: need new backup with naming convention | #### Verify Naming Convention is Active ```bash # Check if new naming convention is being used (after Oct 11, 2025) ssh root@10.0.20.202 "ls /mnt/pve/oracle-backups/ROA/autobackup/ | grep -E '^(L0_|L1_|ARC_|SPFILE_|CF_)' | wc -l" # Should return > 0 if active # If 0, backups are still using old format (O1_MF_ANNNN_*) # Wait for next scheduled backup (02:30 daily) or run manual backup ``` #### Manual Test Run with Verbose Output ```bash # Run test with full output visible ssh root@10.0.20.202 cd /opt/scripts ./weekly-dr-test-proxmox.sh 2>&1 | tee /tmp/dr_test_manual.log # Watch in real-time what's happening # Look for these key stages: # - "TEST MODE: Selecting latest backup set" # - "Total files selected: XX" # - "RMAN restore completed successfully" # - "OPEN_MODE: READ WRITE" ``` ### ❌ Backup Monitor Not Sending Alerts ```bash # 1. Check templates exist ssh root@10.0.20.202 "ls /usr/share/pve-manager/templates/default/oracle-*" # 2. Reinstall templates ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh --install" # 3. Check Proxmox notifications work ssh root@10.0.20.202 "pvesh create /nodes/$(hostname)/apt/update" # Should receive update notification ``` ### ❌ F:\ Drive Not Accessible in VM ```bash # On VM 109: # 1. Check NFS Client service Get-Service | Where {$_.Name -like "*NFS*"} # 2. Manual mount mount -o rw,nolock,mtype=hard,timeout=60 10.0.20.202:/mnt/pve/oracle-backups F: # 3. Check Proxmox NFS server ssh root@10.0.20.202 "showmount -e localhost" # Should show: /mnt/pve/oracle-backups 10.0.20.37 ``` ### ❌ Restore Fails ```bash # 1. Check backup files exist dir F:\ROA\autobackup\*.BKP # 2. Check Oracle service sc query OracleServiceROA # 3. Check PFILE exists dir C:\Users\oracle\admin\ROA\pfile\initROA.ora # 4. View restore log type D:\oracle\logs\restore_from_zero.log ``` ### ❌ VM Won't Start ```bash # Check VM status ssh root@10.0.20.202 "qm status 109" # Check VM config ssh root@10.0.20.202 "qm config 109 | grep -E 'memory|cores|bootdisk'" # Force unlock if locked ssh root@10.0.20.202 "qm unlock 109" # Start with console ssh root@10.0.20.202 "qm start 109 && qm terminal 109" ``` ## 📈 Monitoring & Metrics ### Key Metrics | Metric | Target | Alert Threshold | |--------|--------|-----------------| | FULL Backup Age | < 24h | > 25h | | CUMULATIVE Age | < 6h | > 7h | | Backup Size | ~7 GB/day | > 10 GB | | Restore Time | < 15 min | > 30 min | | Disk Usage | < 80% | > 80% | ### Check Logs ```bash # Backup logs (on PRIMARY) Get-Content D:\rman_backup\logs\backup_*.log -Tail 50 # Transfer logs (on PRIMARY) - UNIFIED script Get-Content D:\rman_backup\logs\transfer_*.log -Tail 50 # Monitoring logs (on Proxmox) tail -50 /var/log/oracle-dr/*.log # Restore logs (on VM 109) type D:\oracle\logs\restore_from_zero.log ``` ## 🔐 Security & Access ### SSH Keys Setup ``` PRIMARY (10.0.20.36) ──────► PROXMOX (10.0.20.202) SSH Key Port 22 LINUX WORKSTATION ─────────► PROXMOX (10.0.20.202) SSH Key Port 22 LINUX WORKSTATION ─────────► VM 109 (10.0.20.37) SSH Key Port 22122 ``` ### Required Credentials - **PRIMARY**: Administrator (for scheduled tasks) - **PROXMOX**: root (for scripts and VM control) - **VM 109**: romfast (user), SYSTEM (Oracle service) ## 📅 Maintenance Schedule | Day | Time | Action | Duration | Impact | |-----|------|--------|----------|--------| | Daily | 02:30 | FULL Backup | 30 min | None | | Daily | 09:00 | Monitor Backups | 1 min | None | | Daily | 13:00 | CUMULATIVE Backup | 5 min | None | | Daily | 18:00 | CUMULATIVE Backup | 5 min | None | | Saturday | 06:00 | DR Test | 30 min | None | ## 🚨 Disaster Recovery Procedure ### When PRIMARY is DOWN: 1. **Confirm PRIMARY is unreachable** ```bash ping 10.0.20.36 # Should fail ``` 2. **Start DR VM** ```bash ssh root@10.0.20.202 "qm start 109" ``` 3. **Wait for boot (3 minutes)** 4. **Connect to DR VM** ```bash ssh -p 22122 romfast@10.0.20.37 ``` 5. **Run restore** ```cmd D:\oracle\scripts\rman_restore_from_zero.cmd ``` 6. **Verify database** ```sql sqlplus / as sysdba SELECT name, open_mode FROM v$database; -- Should show: ROA, READ WRITE ``` 7. **Update application connections** - Change from: 10.0.20.36:1521/ROA - Change to: 10.0.20.37:1521/ROA 8. **Monitor DR system** - Database is now production - Do NOT run cleanup! - Keep VM running ## 📝 Quick Reference Card ``` ╔══════════════════════════════════════════════════════════════╗ ║ DR QUICK REFERENCE ║ ╠══════════════════════════════════════════════════════════════╣ ║ PRIMARY DOWN? ║ ║ ssh root@10.0.20.202 ║ ║ qm start 109 ║ ║ # Wait 3 min ║ ║ ssh -p 22122 romfast@10.0.20.37 ║ ║ D:\oracle\scripts\rman_restore_from_zero.cmd ║ ╠══════════════════════════════════════════════════════════════╣ ║ TEST DR? ║ ║ ssh root@10.0.20.202 "/opt/scripts/weekly-dr-test-proxmox.sh"║ ╠══════════════════════════════════════════════════════════════╣ ║ CHECK BACKUPS? ║ ║ ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh"║ ╠══════════════════════════════════════════════════════════════╣ ║ SUPPORT: ║ ║ Logs: /var/log/oracle-dr/ ║ ║ Docs: proxmox/vm109-windows-dr/docs/ ║ ╚══════════════════════════════════════════════════════════════╝ ``` --- ## 📂 Structură Director ``` vm109-windows-dr/ ├── README.md # Acest fișier ├── docs/ │ ├── PLAN_TESTARE_MONITORIZARE.md # Plan testare și monitorizare DR │ ├── PROXMOX_NOTIFICATIONS_README.md # Configurare notificări Proxmox │ ├── FAILBACK_PROCEDURE.md # Failback DR → PRIMARY (procedura inversă) │ └── archive/ # Planuri și statusuri anterioare │ ├── DR_UPGRADE_TO_CUMULATIVE_PLAN.md │ ├── DR_VM_MIGRATION_GUIDE.md │ ├── DR_WINDOWS_VM_IMPLEMENTATION_PLAN.md │ └── DR_WINDOWS_VM_STATUS_2025-10-09.md └── scripts/ ├── oracle-backup-monitor-proxmox.sh # Monitorizare zilnică (Proxmox) ├── weekly-dr-test-proxmox.sh # Test săptămânal DR (Proxmox) ├── rman_backup.bat # RMAN full backup (Windows) ├── rman_backup_incremental.bat # RMAN incremental (Windows) ├── transfer_backups.ps1 # Transfer backup-uri (Windows) ├── rman_restore_from_zero.ps1 # Restore PRIMARY → DR (disaster activation) ├── rman_restore_to_primary.ps1 # Restore DR → PRIMARY (failback) ├── cleanup_database.ps1 # Cleanup după test (Windows DR) └── *.ps1 # Alte scripturi configurare ``` --- **Last Updated:** 2026-01-27 **Version:** 2.2 - Unified transfer script (transfer_backups.ps1) **Status:** ✅ Production Ready ## 📋 Changelog ### v2.2 (Oct 31, 2025) - ✨ **Unified transfer script**: Replaced `transfer_to_dr.ps1` and `transfer_incremental.ps1` with single `transfer_backups.ps1` - 🎯 **Smart duplicate detection**: Automatically skips files that exist on DR - ⚡ **Flexible scheduling**: Can run after any backup type or manually - 🔧 **Simplified maintenance**: One script to maintain instead of two ### v2.1 (Oct 11, 2025) - Added restore test debugging guide - Implemented new backup naming convention