Files
ROMFASTSQL/proxmox/vm109-windows-dr/README.md
Claude Agent 1203c24d63 docs(proxmox): document HA, corosync tuning, diagnostic tools and mail relay
Following the 2026-04-20 cluster outage, the cluster README now covers
HA resource limits, corosync token tuning (10s tolerance for USB glitches),
rasdaemon/netconsole/kdump diagnostic stack on pvemini, mail relay via
mail.romfast.ro with SMTP auth, OOM alerting via cron, and swap on pveelite.

VM 109 README now clearly states it was removed from HA and is only
started by the weekly DR test script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 11:30:46 +00:00

23 KiB

VM 109 - Oracle DR System (Windows Standby)

Director Proxmox: proxmox/vm109-windows-dr/ VMID: 109 Rol: Disaster Recovery pentru Oracle Database (backup RMAN de pe server Windows extern)

⚠️ Important — VM 109 NU este în HA (din 2026-04-20)

După incidentul 2026-04-20 (vezi ../cluster/incidents/2026-04-20-cluster-outage.md), VM 109 a fost scos din HA cu ha-manager remove vm:109. Motivele:

  • VM 109 este un DR test VM, nu un serviciu live
  • Scriptul DR test de sâmbătă (scripts/weekly-dr-test-proxmox.sh) pornește/oprește VM 109 manual cu qm start/stop
  • Cu HA activ, un bug set -e în script a lăsat VM 109 pornit 2.5 zile, apoi la crashul pvemini HA a relocat VM 109 pe pveelite (16 GB) → OOM cascade

Efecte:

  • VM 109 NU mai e repornit automat la crash node
  • VM 109 NU se mai mută de pe pvemini
  • VM 109 pornește DOAR la invocarea scriptului DR sau manual cu qm start 109
  • Scriptul DR are acum trap cleanup_vm EXIT care garantează qm stop 109 la orice ieșire

Verificare status:

ssh root@10.0.20.201 "qm status 109"          # trebuie stopped
ssh root@10.0.20.201 "ha-manager status | grep 109 || echo 'nu e în HA'"

🛡️ Oracle DR System - Complete Architecture

📊 System Overview

┌─────────────────────────────────────────────────────────────────┐
│                     PRODUCTION ENVIRONMENT                       │
├─────────────────────────────────────────────────────────────────┤
│  PRIMARY SERVER (10.0.20.36)                                    │
│  Windows Server + Oracle 19c                                     │
│  ┌──────────────────────────────┐                              │
│  │ Database: ROA                 │                              │
│  │ Size: ~80 GB                  │                              │
│  │ Tables: 42,625                │                              │
│  └──────────────────────────────┘                              │
│         │                                                        │
│         ▼ Backups (Daily)                                       │
│  ┌──────────────────────────────┐                              │
│  │ 02:30 - FULL backup (6-7 GB) │                              │
│  │ 13:00 - CUMULATIVE (200 MB)  │                              │
│  │ 18:00 - CUMULATIVE (300 MB)  │                              │
│  └──────────────────────────────┘                              │
└─────────────────────────────────────────────────────────────────┘
                    │
                    │ SSH Transfer (Port 22)
                    ▼
┌─────────────────────────────────────────────────────────────────┐
│                        DR ENVIRONMENT                            │
├─────────────────────────────────────────────────────────────────┤
│  PROXMOX HOST (10.0.20.202 - pveelite)                         │
│  ┌──────────────────────────────┐                              │
│  │ Backup Storage (NFS Server)   │◄─────── Monitoring Scripts  │
│  │ /mnt/pve/oracle-backups/      │         /opt/scripts/       │
│  │ └── ROA/autobackup/           │                              │
│  └──────────────────────────────┘                              │
│         │                                                        │
│         │ NFS Mount (F:\)                                       │
│         ▼                                                        │
│  ┌──────────────────────────────┐                              │
│  │ DR VM 109 (10.0.20.37)       │                              │
│  │ Windows Server + Oracle 19c   │                              │
│  │ Status: OFF (normally)        │                              │
│  │ Starts for: Tests or Disaster │                              │
│  └──────────────────────────────┘                              │
└─────────────────────────────────────────────────────────────────┘

🎯 Quick Actions

Emergency DR Activation (Production Down!)

# 1. Start DR VM
ssh root@10.0.20.202 "qm start 109"

# 2. Connect to VM (wait 3 min for boot)
ssh -p 22122 romfast@10.0.20.37

# 3. Run restore (takes ~10-15 minutes)
D:\oracle\scripts\rman_restore_from_zero.cmd

# 4. Database is now RUNNING - Update app connections to 10.0.20.37

🧪 Weekly Test (Every Saturday)

# Automatic at 06:00 via cron, or manual:
ssh root@10.0.20.202 "/opt/scripts/weekly-dr-test-proxmox.sh"

# What it does:
# ✓ Starts VM → Restores DB → Tests → Cleanup → Shutdown
# ✓ Sends email report with results

📊 Check Backup Health

# Manual check (runs daily at 09:00 automatically)
ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh"

# Output:
# Status: OK
# FULL backup age: 11 hours ✓
# CUMULATIVE backup age: 2 hours ✓
# Disk usage: 45% ✓

🗂️ Component Locations

📁 PRIMARY Server (10.0.20.36)

D:\rman_backup\
├── rman_backup_full.txt          # RMAN script for FULL backup
├── rman_backup_incremental.txt   # RMAN script for CUMULATIVE
└── transfer_backups.ps1          # UNIFIED: Transfer ALL backups to Proxmox

Scheduled Tasks:
├── 02:30 - Oracle RMAN Full Backup
├── 03:00 - Transfer backups to DR (transfer_backups.ps1)
├── 13:00 - Oracle RMAN Cumulative Backup
├── 14:45 - Transfer backups to DR (transfer_backups.ps1)
└── 18:00 - Oracle RMAN Cumulative Backup

📁 PROXMOX Host (10.0.20.202)

/opt/scripts/
├── oracle-backup-monitor-proxmox.sh  # Daily backup monitoring
├── weekly-dr-test-proxmox.sh         # Weekly DR test
└── PROXMOX_NOTIFICATIONS_README.md   # Documentation

/mnt/pve/oracle-backups/ROA/autobackup/
├── FULL_20251010_023001.BKP         # Latest FULL backup
├── INCR_20251010_130001.BKP         # CUMULATIVE 13:00
└── INCR_20251010_180001.BKP         # CUMULATIVE 18:00

Cron Jobs:
0 9 * * * /opt/scripts/oracle-backup-monitor-proxmox.sh
0 6 * * 6 /opt/scripts/weekly-dr-test-proxmox.sh

📁 DR VM 109 (10.0.20.37) - When Running

D:\oracle\scripts\
├── rman_restore_from_zero.cmd    # Main restore script ⭐
├── cleanup_database.cmd          # Cleanup after test
└── mount-nfs.bat                 # Mount F:\ at startup

F:\ (NFS mount from Proxmox)
└── ROA\autobackup\               # All backup files

🔄 How It Works

Backup Flow (Daily)

PRIMARY                         PROXMOX
   │                               │
   ├─02:30─FULL─Backup─────────────►
   │         (6-7 GB)               │
   ├─03:00─Transfer ALL────────────► Skip duplicates
   │      (transfer_backups.ps1)   │
   │                               │
   ├─13:00─CUMULATIVE──────────────►
   │         (200 MB)               │
   ├─14:45─Transfer ALL────────────► Skip duplicates
   │      (transfer_backups.ps1)   │ (only new files)
   │                               │
   └─18:00─CUMULATIVE──────────────►
             (300 MB)            Storage
                                    │
                             ┌──────────┐
                             │ Monitor  │ 09:00 Daily
                             │ Check Age│ Alert if old
                             └──────────┘

Restore Process

Start VM → Mount F:\ → Copy Backups → RMAN Restore → Database OPEN
  2min      Auto         2min           8min           Ready!

Total Time: ~15 minutes

🔧 Manual Operations

Test Individual Components

# 1. Test backup transfer (on PRIMARY)
powershell -ExecutionPolicy Bypass -File "D:\rman_backup\transfer_backups.ps1"

# 2. Test NFS mount (on VM 109)
mount -o rw,nolock,mtype=hard,timeout=60 10.0.20.202:/mnt/pve/oracle-backups F:
dir F:\ROA\autobackup

# 3. Test notification system
ssh root@10.0.20.202 "touch -d '2 days ago' /mnt/pve/oracle-backups/ROA/autobackup/*FULL*.BKP"
ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh"
# Should send WARNING notification

# 4. Test database restore (on VM 109)
D:\oracle\scripts\rman_restore_from_zero.cmd

Force Actions

# Force backup now (on PRIMARY)
rman cmdfile=D:\rman_backup\rman_backup_incremental.txt

# Force cleanup VM (on VM 109)
D:\oracle\scripts\cleanup_database.cmd

# Force VM shutdown
ssh root@10.0.20.202 "qm stop 109"

🐛 Troubleshooting

🔍 Debugging Restore Tests

Check Backup Files on Proxmox (10.0.20.202)

# 1. List all backup files with size and date
ssh root@10.0.20.202 "ls -lht /mnt/pve/oracle-backups/ROA/autobackup/*.BKP"

# 2. Count backup files
ssh root@10.0.20.202 "ls /mnt/pve/oracle-backups/ROA/autobackup/*.BKP | wc -l"

# 3. Check latest backups (last 24 hours)
ssh root@10.0.20.202 "find /mnt/pve/oracle-backups/ROA/autobackup -name '*.BKP' -mtime -1 -ls"

# 4. Show backup files grouped by type (with new naming convention)
ssh root@10.0.20.202 "ls -lh /mnt/pve/oracle-backups/ROA/autobackup/ | grep -E '(L0_|L1_|ARC_|SPFILE_|CF_|O1_MF)'"

# 5. Check disk space usage
ssh root@10.0.20.202 "df -h /mnt/pve/oracle-backups"
ssh root@10.0.20.202 "du -sh /mnt/pve/oracle-backups/ROA/autobackup/"

# 6. Verify newest backup timestamp
ssh root@10.0.20.202 "stat /mnt/pve/oracle-backups/ROA/autobackup/L0_*.BKP 2>/dev/null | grep Modify || echo 'No L0 backups with new naming'"

Verify Backup Files on DR VM (when running)

# 1. Check NFS mount is accessible
Test-Path F:\ROA\autobackup

# 2. List all backup files
Get-ChildItem F:\ROA\autobackup\*.BKP | Format-Table Name, Length, LastWriteTime

# 3. Count backup files
(Get-ChildItem F:\ROA\autobackup\*.BKP).Count

# 4. Show total backup size
"{0:N2} GB" -f ((Get-ChildItem F:\ROA\autobackup\*.BKP | Measure-Object -Property Length -Sum).Sum / 1GB)

# 5. Check latest Level 0 backup
Get-ChildItem F:\ROA\autobackup\L0_*.BKP -ErrorAction SilentlyContinue | Sort-Object LastWriteTime -Descending | Select-Object -First 1

# 6. Check what was copied during last restore
Get-Content D:\oracle\logs\restore_from_zero.log | Select-String "Copying|Copied"

Check DR Test Results

# 1. View latest DR test log
ssh root@10.0.20.202 "ls -lt /var/log/oracle-dr/dr_test_*.log | head -1 | awk '{print \$9}' | xargs cat | tail -100"

# 2. Check test status (passed/failed)
ssh root@10.0.20.202 "grep -E 'PASSED|FAILED|Database Verification' /var/log/oracle-dr/dr_test_*.log | tail -5"

# 3. See backup selection logic output
ssh root@10.0.20.202 "grep -A5 'TEST MODE: Selecting' /var/log/oracle-dr/dr_test_*.log | tail -20"

# 4. Check how many files were selected
ssh root@10.0.20.202 "grep 'Total files selected' /var/log/oracle-dr/dr_test_*.log | tail -1"

# 5. View RMAN errors (if any)
ssh root@10.0.20.202 "grep -i 'RMAN-\|ORA-' /var/log/oracle-dr/dr_test_*.log | tail -20"

Simulate Test Locally (on DR VM)

# 1. Start Oracle service manually
Start-Service OracleServiceROA

# 2. Run cleanup to prepare for restore
D:\oracle\scripts\cleanup_database.ps1 /SILENT

# 3. Run restore in test mode
D:\oracle\scripts\rman_restore_from_zero.ps1 -TestMode

# 4. Verify database opened correctly
sqlplus / as sysdba @D:\oracle\scripts\verify_db.sql

# 5. Check what backups were used
Get-Content D:\oracle\logs\restore_from_zero.log | Select-String "backup piece"

# 6. View database verification output
Get-Content D:\oracle\logs\restore_from_zero.log | Select-String -Pattern "DB_NAME|OPEN_MODE|TABLES" -Context 0,1

Common Restore Test Issues

Issue Check Fix
Test reports FAILED but DB is open Check log for "OPEN_MODE: READ WRITE" Already fixed in latest version
Missing datafiles in restore Count backup files: should be 15-40+ Wait for next full backup or copy all files
"No backups found" error Verify NFS mount: Test-Path F:\ Remount NFS or check Proxmox NFS service
Restore takes > 30 min Check backup size: should be ~5-8 GB Normal for first restore after format change
RMAN-06023 errors Check for L0_*.BKP files on F:\ Old format: need new backup with naming convention

Verify Naming Convention is Active

# Check if new naming convention is being used (after Oct 11, 2025)
ssh root@10.0.20.202 "ls /mnt/pve/oracle-backups/ROA/autobackup/ | grep -E '^(L0_|L1_|ARC_|SPFILE_|CF_)' | wc -l"
# Should return > 0 if active

# If 0, backups are still using old format (O1_MF_ANNNN_*)
# Wait for next scheduled backup (02:30 daily) or run manual backup

Manual Test Run with Verbose Output

# Run test with full output visible
ssh root@10.0.20.202
cd /opt/scripts
./weekly-dr-test-proxmox.sh 2>&1 | tee /tmp/dr_test_manual.log

# Watch in real-time what's happening
# Look for these key stages:
# - "TEST MODE: Selecting latest backup set"
# - "Total files selected: XX"
# - "RMAN restore completed successfully"
# - "OPEN_MODE: READ WRITE"

Backup Monitor Not Sending Alerts

# 1. Check templates exist
ssh root@10.0.20.202 "ls /usr/share/pve-manager/templates/default/oracle-*"

# 2. Reinstall templates
ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh --install"

# 3. Check Proxmox notifications work
ssh root@10.0.20.202 "pvesh create /nodes/$(hostname)/apt/update"
# Should receive update notification

F:\ Drive Not Accessible in VM

# On VM 109:
# 1. Check NFS Client service
Get-Service | Where {$_.Name -like "*NFS*"}

# 2. Manual mount
mount -o rw,nolock,mtype=hard,timeout=60 10.0.20.202:/mnt/pve/oracle-backups F:

# 3. Check Proxmox NFS server
ssh root@10.0.20.202 "showmount -e localhost"
# Should show: /mnt/pve/oracle-backups 10.0.20.37

Restore Fails

# 1. Check backup files exist
dir F:\ROA\autobackup\*.BKP

# 2. Check Oracle service
sc query OracleServiceROA

# 3. Check PFILE exists
dir C:\Users\oracle\admin\ROA\pfile\initROA.ora

# 4. View restore log
type D:\oracle\logs\restore_from_zero.log

VM Won't Start

# Check VM status
ssh root@10.0.20.202 "qm status 109"

# Check VM config
ssh root@10.0.20.202 "qm config 109 | grep -E 'memory|cores|bootdisk'"

# Force unlock if locked
ssh root@10.0.20.202 "qm unlock 109"

# Start with console
ssh root@10.0.20.202 "qm start 109 && qm terminal 109"

📈 Monitoring & Metrics

Key Metrics

Metric Target Alert Threshold
FULL Backup Age < 24h > 25h
CUMULATIVE Age < 6h > 7h
Backup Size ~7 GB/day > 10 GB
Restore Time < 15 min > 30 min
Disk Usage < 80% > 80%

Check Logs

# Backup logs (on PRIMARY)
Get-Content D:\rman_backup\logs\backup_*.log -Tail 50

# Transfer logs (on PRIMARY) - UNIFIED script
Get-Content D:\rman_backup\logs\transfer_*.log -Tail 50

# Monitoring logs (on Proxmox)
tail -50 /var/log/oracle-dr/*.log

# Restore logs (on VM 109)
type D:\oracle\logs\restore_from_zero.log

🔐 Security & Access

SSH Keys Setup

PRIMARY (10.0.20.36) ──────► PROXMOX (10.0.20.202)
                      SSH Key
                      Port 22

LINUX WORKSTATION ─────────► PROXMOX (10.0.20.202)
                      SSH Key
                      Port 22

LINUX WORKSTATION ─────────► VM 109 (10.0.20.37)
                      SSH Key
                      Port 22122

Required Credentials

  • PRIMARY: Administrator (for scheduled tasks)
  • PROXMOX: root (for scripts and VM control)
  • VM 109: romfast (user), SYSTEM (Oracle service)

📅 Maintenance Schedule

Day Time Action Duration Impact
Daily 02:30 FULL Backup 30 min None
Daily 09:00 Monitor Backups 1 min None
Daily 13:00 CUMULATIVE Backup 5 min None
Daily 18:00 CUMULATIVE Backup 5 min None
Saturday 06:00 DR Test 30 min None

🚨 Disaster Recovery Procedure

When PRIMARY is DOWN:

  1. Confirm PRIMARY is unreachable

    ping 10.0.20.36  # Should fail
    
  2. Start DR VM

    ssh root@10.0.20.202 "qm start 109"
    
  3. Wait for boot (3 minutes)

  4. Connect to DR VM

    ssh -p 22122 romfast@10.0.20.37
    
  5. Run restore

    D:\oracle\scripts\rman_restore_from_zero.cmd
    
  6. Verify database

    sqlplus / as sysdba
    SELECT name, open_mode FROM v$database;
    -- Should show: ROA, READ WRITE
    
  7. Update application connections

    • Change from: 10.0.20.36:1521/ROA
    • Change to: 10.0.20.37:1521/ROA
  8. Monitor DR system

    • Database is now production
    • Do NOT run cleanup!
    • Keep VM running

📝 Quick Reference Card

╔══════════════════════════════════════════════════════════════╗
║                    DR QUICK REFERENCE                        ║
╠══════════════════════════════════════════════════════════════╣
║ PRIMARY DOWN?                                                ║
║ ssh root@10.0.20.202                                        ║
║ qm start 109                                                 ║
║ # Wait 3 min                                                 ║
║ ssh -p 22122 romfast@10.0.20.37                            ║
║ D:\oracle\scripts\rman_restore_from_zero.cmd                ║
╠══════════════════════════════════════════════════════════════╣
║ TEST DR?                                                     ║
║ ssh root@10.0.20.202 "/opt/scripts/weekly-dr-test-proxmox.sh"║
╠══════════════════════════════════════════════════════════════╣
║ CHECK BACKUPS?                                               ║
║ ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh"║
╠══════════════════════════════════════════════════════════════╣
║ SUPPORT:                                                     ║
║ Logs: /var/log/oracle-dr/                                   ║
║ Docs: proxmox/vm109-windows-dr/docs/                        ║
╚══════════════════════════════════════════════════════════════╝

📂 Structură Director

vm109-windows-dr/
├── README.md                           # Acest fișier
├── docs/
│   ├── PLAN_TESTARE_MONITORIZARE.md    # Plan testare și monitorizare DR
│   ├── PROXMOX_NOTIFICATIONS_README.md # Configurare notificări Proxmox
│   └── archive/                        # Planuri și statusuri anterioare
│       ├── DR_UPGRADE_TO_CUMULATIVE_PLAN.md
│       ├── DR_VM_MIGRATION_GUIDE.md
│       ├── DR_WINDOWS_VM_IMPLEMENTATION_PLAN.md
│       └── DR_WINDOWS_VM_STATUS_2025-10-09.md
└── scripts/
    ├── oracle-backup-monitor-proxmox.sh  # Monitorizare zilnică (Proxmox)
    ├── weekly-dr-test-proxmox.sh         # Test săptămânal DR (Proxmox)
    ├── rman_backup.bat                   # RMAN full backup (Windows)
    ├── rman_backup_incremental.bat       # RMAN incremental (Windows)
    ├── transfer_backups.ps1              # Transfer backup-uri (Windows)
    ├── rman_restore_from_zero.ps1        # Restore complet (Windows DR)
    ├── cleanup_database.ps1              # Cleanup după test (Windows DR)
    └── *.ps1                             # Alte scripturi configurare

Last Updated: 2026-01-27 Version: 2.2 - Unified transfer script (transfer_backups.ps1) Status: Production Ready

📋 Changelog

v2.2 (Oct 31, 2025)

  • Unified transfer script: Replaced transfer_to_dr.ps1 and transfer_incremental.ps1 with single transfer_backups.ps1
  • 🎯 Smart duplicate detection: Automatically skips files that exist on DR
  • Flexible scheduling: Can run after any backup type or manually
  • 🔧 Simplified maintenance: One script to maintain instead of two

v2.1 (Oct 11, 2025)

  • Added restore test debugging guide
  • Implemented new backup naming convention