- Add section 'Debugging Restore Tests' with practical troubleshooting commands - Check backup files on Proxmox: list, count, verify timestamps - Verify backup files on DR VM: NFS mount, file counts, sizes - Check DR test results: parse logs for PASSED/FAILED status - Simulate test locally: manual restore steps for debugging - Common issues table with checks and fixes - Verify naming convention is active (L0_*, L1_* format) - Manual test run with verbose output for real-time monitoring Helps diagnose issues like: - False FAILED notifications - Missing datafiles - RMAN-06023 errors - Backup selection problems Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
513 lines
19 KiB
Markdown
513 lines
19 KiB
Markdown
# 🛡️ Oracle DR System - Complete Architecture
|
|
|
|
## 📊 System Overview
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ PRODUCTION ENVIRONMENT │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ PRIMARY SERVER (10.0.20.36) │
|
|
│ Windows Server + Oracle 19c │
|
|
│ ┌──────────────────────────────┐ │
|
|
│ │ Database: ROA │ │
|
|
│ │ Size: ~80 GB │ │
|
|
│ │ Tables: 42,625 │ │
|
|
│ └──────────────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ Backups (Daily) │
|
|
│ ┌──────────────────────────────┐ │
|
|
│ │ 02:30 - FULL backup (6-7 GB) │ │
|
|
│ │ 13:00 - CUMULATIVE (200 MB) │ │
|
|
│ │ 18:00 - CUMULATIVE (300 MB) │ │
|
|
│ └──────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
│ SSH Transfer (Port 22)
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ DR ENVIRONMENT │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ PROXMOX HOST (10.0.20.202 - pveelite) │
|
|
│ ┌──────────────────────────────┐ │
|
|
│ │ Backup Storage (NFS Server) │◄─────── Monitoring Scripts │
|
|
│ │ /mnt/pve/oracle-backups/ │ /opt/scripts/ │
|
|
│ │ └── ROA/autobackup/ │ │
|
|
│ └──────────────────────────────┘ │
|
|
│ │ │
|
|
│ │ NFS Mount (F:\) │
|
|
│ ▼ │
|
|
│ ┌──────────────────────────────┐ │
|
|
│ │ DR VM 109 (10.0.20.37) │ │
|
|
│ │ Windows Server + Oracle 19c │ │
|
|
│ │ Status: OFF (normally) │ │
|
|
│ │ Starts for: Tests or Disaster │ │
|
|
│ └──────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## 🎯 Quick Actions
|
|
|
|
### ⚡ Emergency DR Activation (Production Down!)
|
|
|
|
```bash
|
|
# 1. Start DR VM
|
|
ssh root@10.0.20.202 "qm start 109"
|
|
|
|
# 2. Connect to VM (wait 3 min for boot)
|
|
ssh -p 22122 romfast@10.0.20.37
|
|
|
|
# 3. Run restore (takes ~10-15 minutes)
|
|
D:\oracle\scripts\rman_restore_from_zero.cmd
|
|
|
|
# 4. Database is now RUNNING - Update app connections to 10.0.20.37
|
|
```
|
|
|
|
### 🧪 Weekly Test (Every Saturday)
|
|
|
|
```bash
|
|
# Automatic at 06:00 via cron, or manual:
|
|
ssh root@10.0.20.202 "/opt/scripts/weekly-dr-test-proxmox.sh"
|
|
|
|
# What it does:
|
|
# ✓ Starts VM → Restores DB → Tests → Cleanup → Shutdown
|
|
# ✓ Sends email report with results
|
|
```
|
|
|
|
### 📊 Check Backup Health
|
|
|
|
```bash
|
|
# Manual check (runs daily at 09:00 automatically)
|
|
ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh"
|
|
|
|
# Output:
|
|
# Status: OK
|
|
# FULL backup age: 11 hours ✓
|
|
# CUMULATIVE backup age: 2 hours ✓
|
|
# Disk usage: 45% ✓
|
|
```
|
|
|
|
## 🗂️ Component Locations
|
|
|
|
### 📁 PRIMARY Server (10.0.20.36)
|
|
```
|
|
D:\rman_backup\
|
|
├── rman_backup_full.txt # RMAN script for FULL backup
|
|
├── rman_backup_incremental.txt # RMAN script for CUMULATIVE
|
|
├── transfer_to_dr.ps1 # Transfer FULL to Proxmox
|
|
└── transfer_incremental.ps1 # Transfer CUMULATIVE to Proxmox
|
|
|
|
Scheduled Tasks:
|
|
├── 02:30 - Oracle RMAN Full Backup
|
|
├── 13:00 - Oracle RMAN Cumulative Backup
|
|
└── 18:00 - Oracle RMAN Cumulative Backup
|
|
```
|
|
|
|
### 📁 PROXMOX Host (10.0.20.202)
|
|
```
|
|
/opt/scripts/
|
|
├── oracle-backup-monitor-proxmox.sh # Daily backup monitoring
|
|
├── weekly-dr-test-proxmox.sh # Weekly DR test
|
|
└── PROXMOX_NOTIFICATIONS_README.md # Documentation
|
|
|
|
/mnt/pve/oracle-backups/ROA/autobackup/
|
|
├── FULL_20251010_023001.BKP # Latest FULL backup
|
|
├── INCR_20251010_130001.BKP # CUMULATIVE 13:00
|
|
└── INCR_20251010_180001.BKP # CUMULATIVE 18:00
|
|
|
|
Cron Jobs:
|
|
0 9 * * * /opt/scripts/oracle-backup-monitor-proxmox.sh
|
|
0 6 * * 6 /opt/scripts/weekly-dr-test-proxmox.sh
|
|
```
|
|
|
|
### 📁 DR VM 109 (10.0.20.37) - When Running
|
|
```
|
|
D:\oracle\scripts\
|
|
├── rman_restore_from_zero.cmd # Main restore script ⭐
|
|
├── cleanup_database.cmd # Cleanup after test
|
|
└── mount-nfs.bat # Mount F:\ at startup
|
|
|
|
F:\ (NFS mount from Proxmox)
|
|
└── ROA\autobackup\ # All backup files
|
|
```
|
|
|
|
## 🔄 How It Works
|
|
|
|
### Backup Flow (Daily)
|
|
```
|
|
PRIMARY PROXMOX
|
|
│ │
|
|
├─02:30─FULL─Backup────────►
|
|
│ (6-7 GB) │
|
|
│ │
|
|
├─13:00─CUMULATIVE─────────►
|
|
│ (200 MB) │
|
|
│ │
|
|
└─18:00─CUMULATIVE─────────►
|
|
(300 MB) Storage
|
|
|
|
┌──────────┐
|
|
│ Monitor │ 09:00 Daily
|
|
│ Check Age│ Alert if old
|
|
└──────────┘
|
|
```
|
|
|
|
### Restore Process
|
|
```
|
|
Start VM → Mount F:\ → Copy Backups → RMAN Restore → Database OPEN
|
|
2min Auto 2min 8min Ready!
|
|
|
|
Total Time: ~15 minutes
|
|
```
|
|
|
|
## 🔧 Manual Operations
|
|
|
|
### Test Individual Components
|
|
|
|
```bash
|
|
# 1. Test backup transfer (on PRIMARY)
|
|
D:\rman_backup\transfer_incremental.ps1
|
|
|
|
# 2. Test NFS mount (on VM 109)
|
|
mount -o rw,nolock,mtype=hard,timeout=60 10.0.20.202:/mnt/pve/oracle-backups F:
|
|
dir F:\ROA\autobackup
|
|
|
|
# 3. Test notification system
|
|
ssh root@10.0.20.202 "touch -d '2 days ago' /mnt/pve/oracle-backups/ROA/autobackup/*FULL*.BKP"
|
|
ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh"
|
|
# Should send WARNING notification
|
|
|
|
# 4. Test database restore (on VM 109)
|
|
D:\oracle\scripts\rman_restore_from_zero.cmd
|
|
```
|
|
|
|
### Force Actions
|
|
|
|
```bash
|
|
# Force backup now (on PRIMARY)
|
|
rman cmdfile=D:\rman_backup\rman_backup_incremental.txt
|
|
|
|
# Force cleanup VM (on VM 109)
|
|
D:\oracle\scripts\cleanup_database.cmd
|
|
|
|
# Force VM shutdown
|
|
ssh root@10.0.20.202 "qm stop 109"
|
|
```
|
|
|
|
## 🐛 Troubleshooting
|
|
|
|
### 🔍 Debugging Restore Tests
|
|
|
|
#### Check Backup Files on Proxmox (10.0.20.202)
|
|
|
|
```bash
|
|
# 1. List all backup files with size and date
|
|
ssh root@10.0.20.202 "ls -lh /mnt/pve/oracle-backups/ROA/autobackup/*.BKP"
|
|
|
|
# 2. Count backup files
|
|
ssh root@10.0.20.202 "ls /mnt/pve/oracle-backups/ROA/autobackup/*.BKP | wc -l"
|
|
|
|
# 3. Check latest backups (last 24 hours)
|
|
ssh root@10.0.20.202 "find /mnt/pve/oracle-backups/ROA/autobackup -name '*.BKP' -mtime -1 -ls"
|
|
|
|
# 4. Show backup files grouped by type (with new naming convention)
|
|
ssh root@10.0.20.202 "ls -lh /mnt/pve/oracle-backups/ROA/autobackup/ | grep -E '(L0_|L1_|ARC_|SPFILE_|CF_|O1_MF)'"
|
|
|
|
# 5. Check disk space usage
|
|
ssh root@10.0.20.202 "df -h /mnt/pve/oracle-backups"
|
|
|
|
# 6. Verify newest backup timestamp
|
|
ssh root@10.0.20.202 "stat /mnt/pve/oracle-backups/ROA/autobackup/L0_*.BKP 2>/dev/null | grep Modify || echo 'No L0 backups with new naming'"
|
|
```
|
|
|
|
#### Verify Backup Files on DR VM (when running)
|
|
|
|
```powershell
|
|
# 1. Check NFS mount is accessible
|
|
Test-Path F:\ROA\autobackup
|
|
|
|
# 2. List all backup files
|
|
Get-ChildItem F:\ROA\autobackup\*.BKP | Format-Table Name, Length, LastWriteTime
|
|
|
|
# 3. Count backup files
|
|
(Get-ChildItem F:\ROA\autobackup\*.BKP).Count
|
|
|
|
# 4. Show total backup size
|
|
"{0:N2} GB" -f ((Get-ChildItem F:\ROA\autobackup\*.BKP | Measure-Object -Property Length -Sum).Sum / 1GB)
|
|
|
|
# 5. Check latest Level 0 backup
|
|
Get-ChildItem F:\ROA\autobackup\L0_*.BKP -ErrorAction SilentlyContinue | Sort-Object LastWriteTime -Descending | Select-Object -First 1
|
|
|
|
# 6. Check what was copied during last restore
|
|
Get-Content D:\oracle\logs\restore_from_zero.log | Select-String "Copying|Copied"
|
|
```
|
|
|
|
#### Check DR Test Results
|
|
|
|
```bash
|
|
# 1. View latest DR test log
|
|
ssh root@10.0.20.202 "ls -lt /var/log/oracle-dr/dr_test_*.log | head -1 | awk '{print \$9}' | xargs cat | tail -100"
|
|
|
|
# 2. Check test status (passed/failed)
|
|
ssh root@10.0.20.202 "grep -E 'PASSED|FAILED|Database Verification' /var/log/oracle-dr/dr_test_*.log | tail -5"
|
|
|
|
# 3. See backup selection logic output
|
|
ssh root@10.0.20.202 "grep -A5 'TEST MODE: Selecting' /var/log/oracle-dr/dr_test_*.log | tail -20"
|
|
|
|
# 4. Check how many files were selected
|
|
ssh root@10.0.20.202 "grep 'Total files selected' /var/log/oracle-dr/dr_test_*.log | tail -1"
|
|
|
|
# 5. View RMAN errors (if any)
|
|
ssh root@10.0.20.202 "grep -i 'RMAN-\|ORA-' /var/log/oracle-dr/dr_test_*.log | tail -20"
|
|
```
|
|
|
|
#### Simulate Test Locally (on DR VM)
|
|
|
|
```powershell
|
|
# 1. Start Oracle service manually
|
|
Start-Service OracleServiceROA
|
|
|
|
# 2. Run cleanup to prepare for restore
|
|
D:\oracle\scripts\cleanup_database.ps1 /SILENT
|
|
|
|
# 3. Run restore in test mode
|
|
D:\oracle\scripts\rman_restore_from_zero.ps1 -TestMode
|
|
|
|
# 4. Verify database opened correctly
|
|
sqlplus / as sysdba @D:\oracle\scripts\verify_db.sql
|
|
|
|
# 5. Check what backups were used
|
|
Get-Content D:\oracle\logs\restore_from_zero.log | Select-String "backup piece"
|
|
|
|
# 6. View database verification output
|
|
Get-Content D:\oracle\logs\restore_from_zero.log | Select-String -Pattern "DB_NAME|OPEN_MODE|TABLES" -Context 0,1
|
|
```
|
|
|
|
#### Common Restore Test Issues
|
|
|
|
| Issue | Check | Fix |
|
|
|-------|-------|-----|
|
|
| Test reports FAILED but DB is open | Check log for "OPEN_MODE: READ WRITE" | Already fixed in latest version |
|
|
| Missing datafiles in restore | Count backup files: should be 15-40+ | Wait for next full backup or copy all files |
|
|
| "No backups found" error | Verify NFS mount: `Test-Path F:\` | Remount NFS or check Proxmox NFS service |
|
|
| Restore takes > 30 min | Check backup size: should be ~5-8 GB | Normal for first restore after format change |
|
|
| RMAN-06023 errors | Check for L0_*.BKP files on F:\ | Old format: need new backup with naming convention |
|
|
|
|
#### Verify Naming Convention is Active
|
|
|
|
```bash
|
|
# Check if new naming convention is being used (after Oct 11, 2025)
|
|
ssh root@10.0.20.202 "ls /mnt/pve/oracle-backups/ROA/autobackup/ | grep -E '^(L0_|L1_|ARC_|SPFILE_|CF_)' | wc -l"
|
|
# Should return > 0 if active
|
|
|
|
# If 0, backups are still using old format (O1_MF_ANNNN_*)
|
|
# Wait for next scheduled backup (02:30 daily) or run manual backup
|
|
```
|
|
|
|
#### Manual Test Run with Verbose Output
|
|
|
|
```bash
|
|
# Run test with full output visible
|
|
ssh root@10.0.20.202
|
|
cd /opt/scripts
|
|
./weekly-dr-test-proxmox.sh 2>&1 | tee /tmp/dr_test_manual.log
|
|
|
|
# Watch in real-time what's happening
|
|
# Look for these key stages:
|
|
# - "TEST MODE: Selecting latest backup set"
|
|
# - "Total files selected: XX"
|
|
# - "RMAN restore completed successfully"
|
|
# - "OPEN_MODE: READ WRITE"
|
|
```
|
|
|
|
### ❌ Backup Monitor Not Sending Alerts
|
|
|
|
```bash
|
|
# 1. Check templates exist
|
|
ssh root@10.0.20.202 "ls /usr/share/pve-manager/templates/default/oracle-*"
|
|
|
|
# 2. Reinstall templates
|
|
ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh --install"
|
|
|
|
# 3. Check Proxmox notifications work
|
|
ssh root@10.0.20.202 "pvesh create /nodes/$(hostname)/apt/update"
|
|
# Should receive update notification
|
|
```
|
|
|
|
### ❌ F:\ Drive Not Accessible in VM
|
|
|
|
```bash
|
|
# On VM 109:
|
|
# 1. Check NFS Client service
|
|
Get-Service | Where {$_.Name -like "*NFS*"}
|
|
|
|
# 2. Manual mount
|
|
mount -o rw,nolock,mtype=hard,timeout=60 10.0.20.202:/mnt/pve/oracle-backups F:
|
|
|
|
# 3. Check Proxmox NFS server
|
|
ssh root@10.0.20.202 "showmount -e localhost"
|
|
# Should show: /mnt/pve/oracle-backups 10.0.20.37
|
|
```
|
|
|
|
### ❌ Restore Fails
|
|
|
|
```bash
|
|
# 1. Check backup files exist
|
|
dir F:\ROA\autobackup\*.BKP
|
|
|
|
# 2. Check Oracle service
|
|
sc query OracleServiceROA
|
|
|
|
# 3. Check PFILE exists
|
|
dir C:\Users\oracle\admin\ROA\pfile\initROA.ora
|
|
|
|
# 4. View restore log
|
|
type D:\oracle\logs\restore_from_zero.log
|
|
```
|
|
|
|
### ❌ VM Won't Start
|
|
|
|
```bash
|
|
# Check VM status
|
|
ssh root@10.0.20.202 "qm status 109"
|
|
|
|
# Check VM config
|
|
ssh root@10.0.20.202 "qm config 109 | grep -E 'memory|cores|bootdisk'"
|
|
|
|
# Force unlock if locked
|
|
ssh root@10.0.20.202 "qm unlock 109"
|
|
|
|
# Start with console
|
|
ssh root@10.0.20.202 "qm start 109 && qm terminal 109"
|
|
```
|
|
|
|
## 📈 Monitoring & Metrics
|
|
|
|
### Key Metrics
|
|
| Metric | Target | Alert Threshold |
|
|
|--------|--------|-----------------|
|
|
| FULL Backup Age | < 24h | > 25h |
|
|
| CUMULATIVE Age | < 6h | > 7h |
|
|
| Backup Size | ~7 GB/day | > 10 GB |
|
|
| Restore Time | < 15 min | > 30 min |
|
|
| Disk Usage | < 80% | > 80% |
|
|
|
|
### Check Logs
|
|
|
|
```bash
|
|
# Backup logs (on PRIMARY)
|
|
Get-Content D:\rman_backup\logs\backup_*.log -Tail 50
|
|
|
|
# Transfer logs (on PRIMARY)
|
|
Get-Content D:\rman_backup\logs\transfer_*.log -Tail 50
|
|
|
|
# Monitoring logs (on Proxmox)
|
|
tail -50 /var/log/oracle-dr/*.log
|
|
|
|
# Restore logs (on VM 109)
|
|
type D:\oracle\logs\restore_from_zero.log
|
|
```
|
|
|
|
## 🔐 Security & Access
|
|
|
|
### SSH Keys Setup
|
|
```
|
|
PRIMARY (10.0.20.36) ──────► PROXMOX (10.0.20.202)
|
|
SSH Key
|
|
Port 22
|
|
|
|
LINUX WORKSTATION ─────────► PROXMOX (10.0.20.202)
|
|
SSH Key
|
|
Port 22
|
|
|
|
LINUX WORKSTATION ─────────► VM 109 (10.0.20.37)
|
|
SSH Key
|
|
Port 22122
|
|
```
|
|
|
|
### Required Credentials
|
|
- **PRIMARY**: Administrator (for scheduled tasks)
|
|
- **PROXMOX**: root (for scripts and VM control)
|
|
- **VM 109**: romfast (user), SYSTEM (Oracle service)
|
|
|
|
## 📅 Maintenance Schedule
|
|
|
|
| Day | Time | Action | Duration | Impact |
|
|
|-----|------|--------|----------|--------|
|
|
| Daily | 02:30 | FULL Backup | 30 min | None |
|
|
| Daily | 09:00 | Monitor Backups | 1 min | None |
|
|
| Daily | 13:00 | CUMULATIVE Backup | 5 min | None |
|
|
| Daily | 18:00 | CUMULATIVE Backup | 5 min | None |
|
|
| Saturday | 06:00 | DR Test | 30 min | None |
|
|
|
|
## 🚨 Disaster Recovery Procedure
|
|
|
|
### When PRIMARY is DOWN:
|
|
|
|
1. **Confirm PRIMARY is unreachable**
|
|
```bash
|
|
ping 10.0.20.36 # Should fail
|
|
```
|
|
|
|
2. **Start DR VM**
|
|
```bash
|
|
ssh root@10.0.20.202 "qm start 109"
|
|
```
|
|
|
|
3. **Wait for boot (3 minutes)**
|
|
|
|
4. **Connect to DR VM**
|
|
```bash
|
|
ssh -p 22122 romfast@10.0.20.37
|
|
```
|
|
|
|
5. **Run restore**
|
|
```cmd
|
|
D:\oracle\scripts\rman_restore_from_zero.cmd
|
|
```
|
|
|
|
6. **Verify database**
|
|
```sql
|
|
sqlplus / as sysdba
|
|
SELECT name, open_mode FROM v$database;
|
|
-- Should show: ROA, READ WRITE
|
|
```
|
|
|
|
7. **Update application connections**
|
|
- Change from: 10.0.20.36:1521/ROA
|
|
- Change to: 10.0.20.37:1521/ROA
|
|
|
|
8. **Monitor DR system**
|
|
- Database is now production
|
|
- Do NOT run cleanup!
|
|
- Keep VM running
|
|
|
|
## 📝 Quick Reference Card
|
|
|
|
```
|
|
╔══════════════════════════════════════════════════════════════╗
|
|
║ DR QUICK REFERENCE ║
|
|
╠══════════════════════════════════════════════════════════════╣
|
|
║ PRIMARY DOWN? ║
|
|
║ ssh root@10.0.20.202 ║
|
|
║ qm start 109 ║
|
|
║ # Wait 3 min ║
|
|
║ ssh -p 22122 romfast@10.0.20.37 ║
|
|
║ D:\oracle\scripts\rman_restore_from_zero.cmd ║
|
|
╠══════════════════════════════════════════════════════════════╣
|
|
║ TEST DR? ║
|
|
║ ssh root@10.0.20.202 "/opt/scripts/weekly-dr-test-proxmox.sh"║
|
|
╠══════════════════════════════════════════════════════════════╣
|
|
║ CHECK BACKUPS? ║
|
|
║ ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh"║
|
|
╠══════════════════════════════════════════════════════════════╣
|
|
║ SUPPORT: ║
|
|
║ Logs: /var/log/oracle-dr/ ║
|
|
║ Docs: /opt/scripts/PROXMOX_NOTIFICATIONS_README.md ║
|
|
╚══════════════════════════════════════════════════════════════╝
|
|
```
|
|
|
|
---
|
|
|
|
**Last Updated:** October 11, 2025
|
|
**Version:** 2.1 - Added restore test debugging guide + naming convention
|
|
**Status:** ✅ Production Ready |