Compare commits

..

2 Commits

Author SHA1 Message Date
Marius
249bf4d98a debugging 2025-10-11 19:18:37 +03:00
Marius
1b523c1624 Oracle DR: Add comprehensive restore test debugging guide to README
- Add section 'Debugging Restore Tests' with practical troubleshooting commands
- Check backup files on Proxmox: list, count, verify timestamps
- Verify backup files on DR VM: NFS mount, file counts, sizes
- Check DR test results: parse logs for PASSED/FAILED status
- Simulate test locally: manual restore steps for debugging
- Common issues table with checks and fixes
- Verify naming convention is active (L0_*, L1_* format)
- Manual test run with verbose output for real-time monitoring

Helps diagnose issues like:
- False FAILED notifications
- Missing datafiles
- RMAN-06023 errors
- Backup selection problems

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 19:03:08 +03:00

View File

@@ -195,6 +195,131 @@ ssh root@10.0.20.202 "qm stop 109"
## 🐛 Troubleshooting
### 🔍 Debugging Restore Tests
#### Check Backup Files on Proxmox (10.0.20.202)
```bash
# 1. List all backup files with size and date
ssh root@10.0.20.202 "ls -lht /mnt/pve/oracle-backups/ROA/autobackup/*.BKP"
# 2. Count backup files
ssh root@10.0.20.202 "ls /mnt/pve/oracle-backups/ROA/autobackup/*.BKP | wc -l"
# 3. Check latest backups (last 24 hours)
ssh root@10.0.20.202 "find /mnt/pve/oracle-backups/ROA/autobackup -name '*.BKP' -mtime -1 -ls"
# 4. Show backup files grouped by type (with new naming convention)
ssh root@10.0.20.202 "ls -lh /mnt/pve/oracle-backups/ROA/autobackup/ | grep -E '(L0_|L1_|ARC_|SPFILE_|CF_|O1_MF)'"
# 5. Check disk space usage
ssh root@10.0.20.202 "df -h /mnt/pve/oracle-backups"
ssh root@10.0.20.202 "du -sh /mnt/pve/oracle-backups/ROA/autobackup/"
# 6. Verify newest backup timestamp
ssh root@10.0.20.202 "stat /mnt/pve/oracle-backups/ROA/autobackup/L0_*.BKP 2>/dev/null | grep Modify || echo 'No L0 backups with new naming'"
```
#### Verify Backup Files on DR VM (when running)
```powershell
# 1. Check NFS mount is accessible
Test-Path F:\ROA\autobackup
# 2. List all backup files
Get-ChildItem F:\ROA\autobackup\*.BKP | Format-Table Name, Length, LastWriteTime
# 3. Count backup files
(Get-ChildItem F:\ROA\autobackup\*.BKP).Count
# 4. Show total backup size
"{0:N2} GB" -f ((Get-ChildItem F:\ROA\autobackup\*.BKP | Measure-Object -Property Length -Sum).Sum / 1GB)
# 5. Check latest Level 0 backup
Get-ChildItem F:\ROA\autobackup\L0_*.BKP -ErrorAction SilentlyContinue | Sort-Object LastWriteTime -Descending | Select-Object -First 1
# 6. Check what was copied during last restore
Get-Content D:\oracle\logs\restore_from_zero.log | Select-String "Copying|Copied"
```
#### Check DR Test Results
```bash
# 1. View latest DR test log
ssh root@10.0.20.202 "ls -lt /var/log/oracle-dr/dr_test_*.log | head -1 | awk '{print \$9}' | xargs cat | tail -100"
# 2. Check test status (passed/failed)
ssh root@10.0.20.202 "grep -E 'PASSED|FAILED|Database Verification' /var/log/oracle-dr/dr_test_*.log | tail -5"
# 3. See backup selection logic output
ssh root@10.0.20.202 "grep -A5 'TEST MODE: Selecting' /var/log/oracle-dr/dr_test_*.log | tail -20"
# 4. Check how many files were selected
ssh root@10.0.20.202 "grep 'Total files selected' /var/log/oracle-dr/dr_test_*.log | tail -1"
# 5. View RMAN errors (if any)
ssh root@10.0.20.202 "grep -i 'RMAN-\|ORA-' /var/log/oracle-dr/dr_test_*.log | tail -20"
```
#### Simulate Test Locally (on DR VM)
```powershell
# 1. Start Oracle service manually
Start-Service OracleServiceROA
# 2. Run cleanup to prepare for restore
D:\oracle\scripts\cleanup_database.ps1 /SILENT
# 3. Run restore in test mode
D:\oracle\scripts\rman_restore_from_zero.ps1 -TestMode
# 4. Verify database opened correctly
sqlplus / as sysdba @D:\oracle\scripts\verify_db.sql
# 5. Check what backups were used
Get-Content D:\oracle\logs\restore_from_zero.log | Select-String "backup piece"
# 6. View database verification output
Get-Content D:\oracle\logs\restore_from_zero.log | Select-String -Pattern "DB_NAME|OPEN_MODE|TABLES" -Context 0,1
```
#### Common Restore Test Issues
| Issue | Check | Fix |
|-------|-------|-----|
| Test reports FAILED but DB is open | Check log for "OPEN_MODE: READ WRITE" | Already fixed in latest version |
| Missing datafiles in restore | Count backup files: should be 15-40+ | Wait for next full backup or copy all files |
| "No backups found" error | Verify NFS mount: `Test-Path F:\` | Remount NFS or check Proxmox NFS service |
| Restore takes > 30 min | Check backup size: should be ~5-8 GB | Normal for first restore after format change |
| RMAN-06023 errors | Check for L0_*.BKP files on F:\ | Old format: need new backup with naming convention |
#### Verify Naming Convention is Active
```bash
# Check if new naming convention is being used (after Oct 11, 2025)
ssh root@10.0.20.202 "ls /mnt/pve/oracle-backups/ROA/autobackup/ | grep -E '^(L0_|L1_|ARC_|SPFILE_|CF_)' | wc -l"
# Should return > 0 if active
# If 0, backups are still using old format (O1_MF_ANNNN_*)
# Wait for next scheduled backup (02:30 daily) or run manual backup
```
#### Manual Test Run with Verbose Output
```bash
# Run test with full output visible
ssh root@10.0.20.202
cd /opt/scripts
./weekly-dr-test-proxmox.sh 2>&1 | tee /tmp/dr_test_manual.log
# Watch in real-time what's happening
# Look for these key stages:
# - "TEST MODE: Selecting latest backup set"
# - "Total files selected: XX"
# - "RMAN restore completed successfully"
# - "OPEN_MODE: READ WRITE"
```
### ❌ Backup Monitor Not Sending Alerts
```bash
@@ -384,6 +509,6 @@ LINUX WORKSTATION ─────────► VM 109 (10.0.20.37)
---
**Last Updated:** October 10, 2025
**Version:** 2.0 - Complete DR System with Proxmox Integration
**Last Updated:** October 11, 2025
**Version:** 2.1 - Added restore test debugging guide + naming convention
**Status:** ✅ Production Ready