Oracle DR: Phase 6.5 - Complete cleanup and restore scripts (TESTING)

Major improvements to DR restore workflow:

**New Scripts:**
- cleanup_database.cmd: Complete cleanup using oradim + registry deletion
- rman_restore_from_zero.cmd: Copy backups to recovery_area + restore

**Key Solutions Implemented:**
1. RMAN AUTOBACKUP limitation: Must have backups in recovery_area
   - Solution: Copy ALL backups from F:\ (NFS) to C:\...\recovery_area
   - Performance: 6.7 GB copied in ~2 minutes

2. Oracle service persistence issue: Service remains after sc delete
   - Solution: Use oradim -delete -sid ROA (proper Oracle cleanup)
   - Bonus: Delete registry keys to ensure clean state

**Current Status:**
- Cleanup:  TESTED (oradim works perfectly)
- Backup copy:  TESTED (6.7 GB in 2 min)
- RMAN restore: 🟡 IN PROGRESS (expected completion 03:35-03:40)

**Updated:**
- DR_UPGRADE_TO_CUMULATIVE_PLAN.md: Progress tracking + solutions documented
- rman_restore_final.cmd: Use F:\ mount point

🤖 Generated with Claude Code
This commit is contained in:
Marius
2025-10-10 03:29:25 +03:00
parent 8682e0ee04
commit cbad9ee779
4 changed files with 630 additions and 49 deletions

View File

@@ -1,8 +1,8 @@
# Oracle DR - Upgrade to Cumulative Incremental Backup Strategy
**Generated:** 2025-10-09
**Last Updated:** 2025-10-10 00:45
**Status:** 🟡 IN PROGRESS - Phases 1-3-5 COMPLETED
**Last Updated:** 2025-10-10 03:25
**Status:** 🟡 FINAL TESTING IN PROGRESS - RMAN restore running
**Objective:** Implement cumulative incremental backups with Proxmox host storage for optimal RPO/RTO
**Target RPO:** 3-4 hours (vs current 24 hours)
**Target RTO:** 12-15 minutes (unchanged)
@@ -11,7 +11,9 @@
## ✅ IMPLEMENTATION STATUS
### Completed (2025-10-09 + 2025-10-10)
### Completed (2025-10-09 + 2025-10-10 Sessions)
#### Session 1 (2025-10-09 evening)
-**Phase 1:** Proxmox host storage configured (`/mnt/pve/oracle-backups/ROA/autobackup`)
-**Phase 2:** RMAN script already has `CUMULATIVE` keyword
-**Phase 3:** Transfer scripts updated to send to Proxmox (10.0.20.202:22, root)
@@ -19,6 +21,15 @@
- Changed from VM 109 (10.0.20.37:22122) to Proxmox host
- Converted Windows PowerShell commands to Linux bash
-**VM 109 cleanup:** Deleted temporary files, old backups (~6.4 GB freed)
-**SSH Key Setup:** SSH key copied from PRIMARY to Proxmox
- Existing key: `C:\Windows\System32\config\systemprofile\.ssh\id_rsa`
- Copied to: Proxmox `/root/.ssh/authorized_keys`
- SSH passwordless access working ✅
-**Phase 4:** Scheduled tasks modified on PRIMARY
- Task 1: 02:30 FULL backup (unchanged)
- Task 2: 13:00 CUMULATIVE backup (modified from 14:00)
- Task 3: 18:00 CUMULATIVE backup (created)
- All tasks now use Proxmox host as destination
-**Phase 5:** NFS mount point configured on VM 109 → **F:\ drive**
- NFS server installed on Proxmox: `nfs-kernel-server`
- NFS export configured: `/mnt/pve/oracle-backups → 10.0.20.37 (rw,no_root_squash)`
@@ -28,13 +39,60 @@
- Permissions set to 777 on Proxmox directory
- **Status:** F:\ mounts automatically at Windows startup ✅
#### Session 2 (2025-10-10 late night - MAJOR PROGRESS)
-**Phase 6:** Restore scripts updated to use F:\ mount
- `rman_restore_final.cmd` modified to read backups from F:\ROA\autobackup
- Scripts verify F:\ mount is accessible before starting restore
- **FIXED:** Control file restore now uses `RESTORE CONTROLFILE FROM AUTOBACKUP`
- All RMAN catalog commands point to F:\ mount
-**Phase 6.5:** Database cleanup strategy implemented (CRITICAL FEATURE)
- **cleanup_database.cmd** created:
- Deletes Oracle service completely
- Deletes ALL database files (datafiles, control files, redo logs)
- Deletes local FRA (backups safe on F:\)
- Does NOT recreate service (service created during restore)
- Leaves VM in completely clean state
- **rman_restore_from_zero.cmd** created:
- Step 1: Calls cleanup_database.cmd (clean state)
- Step 2.1: Creates Oracle service from PFILE
- Step 2.2: STARTUP NOMOUNT
- Step 2.3: Generates RMAN restore script
- Step 2.4: Runs RMAN restore (control file → mount → catalog → restore → recover → open)
- Step 3: Verifies database
- **Workflow documented:**
- **Weekly test:** restore → test → cleanup → shutdown
- **Real disaster:** restore → keep running (NO cleanup!)
- Saves ~8 GB disk space after each test
- Ensures repeatable, clean DR tests from zero
-**Backup transfer tested:**
- Manual backup executed on PRIMARY
- Transfer script successfully copied 6.7 GB to Proxmox
- Backups verified accessible on F:\ in VM 109
-**Cleanup script tested:**
- Successfully deletes all database files
- Successfully removes Oracle service
- VM confirmed in clean state (no service, no DB files)
- 🟡 **Restore script final test IN PROGRESS:**
- **Key challenges solved:**
- Issue 1: RMAN AUTOBACKUP doesn't work with backups on F:\ (NFS mount)
- Solution: Copy ALL backups from F:\ to C:\Users\oracle\recovery_area before restore
- Issue 2: Oracle service persists in registry after `sc delete`
- Solution: Use `oradim -delete -sid ROA` + delete registry keys manually
- **Current test status:**
- Cleanup: ✅ PASSED (oradim delete works perfectly)
- Service creation: ✅ PASSED
- NOMOUNT: ✅ PASSED
- Backup copy F:\ → recovery_area: ✅ PASSED (6.7 GB in ~2 min)
- RMAN restore: ⏳ RUNNING NOW (expected ~10-15 min)
- Expected completion: 2025-10-10 03:35-03:40
### Pending (Next Session)
-**SSH Key Setup:** Run `copy_existing_key_to_proxmox.ps1` on PRIMARY as Administrator
- Existing key: `C:\Windows\System32\config\systemprofile\.ssh\id_rsa`
- Copy to: Proxmox `/root/.ssh/authorized_keys`
- **Phase 4:** Modify scheduled tasks on PRIMARY (13:00 + 18:00)
- **Phase 6:** Update restore script to use F:\ mount
- **Phase 7:** Test FULL + CUMULATIVE backup and restore
-**Phase 7:** Final end-to-end test (15-20 minutes)
- Run `rman_restore_from_zero.cmd` with fixed control file restore
- Verify database opens successfully
- Test cleanup after successful restore
- **Note:** Backup files already transferred to F:\ (6.7 GB)
- **Issue found and fixed:** Control file restore now uses `RESTORE CONTROLFILE FROM AUTOBACKUP`
### Files Modified
```
@@ -42,11 +100,17 @@ oracle/standby-server-scripts/
├── transfer_incremental.ps1 [MODIFIED] → Proxmox host
├── transfer_to_dr.ps1 [MODIFIED] → Proxmox host
├── rman_backup_incremental.txt [ALREADY OK] → Has CUMULATIVE
── copy_existing_key_to_proxmox.ps1 [NEW] → Setup script for SSH key
── copy_existing_key_to_proxmox.ps1 [NEW] → Setup script for SSH key
├── rman_restore_final.cmd [MODIFIED] → Use F:\ mount
├── cleanup_database.cmd [NEW] → Complete cleanup (oradim + registry)
└── rman_restore_from_zero.cmd [NEW] → Copy backups + restore from recovery_area
VM 109 (Windows):
├── C:\Scripts\mount-nfs.ps1 [NEW] → PowerShell script for NFS mount
── Scheduled Task: "Mount NFS F" [NEW] → Auto-mount at startup
── Scheduled Task: "Mount NFS F" [NEW] → Auto-mount at startup
├── D:\oracle\scripts\rman_restore_final.cmd [MODIFIED] → Use F:\ mount
├── D:\oracle\scripts\cleanup_database.cmd [NEW] → Cleanup script
└── D:\oracle\scripts\rman_restore_from_zero.cmd [NEW] → Full restore from zero
Proxmox (pveelite):
├── /etc/exports [MODIFIED] → NFS export configuration
@@ -617,6 +681,192 @@ exit /b 0
---
### PHASE 6.5: Database Cleanup Strategy - Restore from Zero (NEW)
**Objective:** Keep DR VM clean by restoring from zero each time (no old database files, no Oracle services)
**Why this approach?**
-**Repeatable testing:** Each test starts from known clean state
-**No leftovers:** No old control files, redo logs, or datafiles
-**True DR test:** Simulates real disaster scenario (no database, only Oracle software)
-**No manual cleanup:** Automated cleanup before and after each test
-**Save disk space:** Delete 8+ GB of database files after each test
#### 6.5.1 Cleanup Steps (BEFORE restore)
**What to delete:**
```cmd
REM 1. Stop and delete Oracle service
sc stop OracleServiceROA 2>nul
sc delete OracleServiceROA 2>nul
REM 2. Delete all database files (datafiles, control files, redo logs)
del /Q C:\Users\oracle\oradata\ROA\*.dbf 2>nul
del /Q C:\Users\oracle\oradata\ROA\*.ctl 2>nul
del /Q C:\Users\oracle\oradata\ROA\*.log 2>nul
REM 3. Delete local FRA (backups are on F:\ now, safe to delete)
rmdir /S /Q C:\Users\oracle\recovery_area\ROA 2>nul
mkdir C:\Users\oracle\recovery_area\ROA
REM 4. Delete old trace files (optional, saves space)
del /Q C:\Users\oracle\diag\rdbms\roa\ROA\trace\*.* 2>nul
REM 5. Recreate Oracle service from pfile
oradim -new -sid ROA -startmode manual -pfile C:\Users\oracle\admin\ROA\pfile\initROA.ora
```
**Result:** Clean VM with:
- ✅ Oracle software installed
- ✅ PFILE exists: `C:\Users\oracle\admin\ROA\pfile\initROA.ora`
- ✅ Oracle service created: `OracleServiceROA`
- ❌ No database files (will be restored)
- ❌ No control files (will be restored)
- ❌ No datafiles (will be restored)
#### 6.5.2 Cleanup Steps (AFTER successful restore test)
**Purpose:** Leave VM clean for next test, conserve disk space
```cmd
REM After verifying database is working:
REM 1. Shutdown database
sqlplus / as sysdba <<EOF
SHUTDOWN ABORT;
EXIT;
EOF
REM 2. Delete Oracle service
sc stop OracleServiceROA
sc delete OracleServiceROA
REM 3. Delete all database files
del /Q C:\Users\oracle\oradata\ROA\*.dbf
del /Q C:\Users\oracle\oradata\ROA\*.ctl
del /Q C:\Users\oracle\oradata\ROA\*.log
rmdir /S /Q C:\Users\oracle\recovery_area\ROA
REM 4. VM is now clean and ready for next test
echo Database cleanup complete - VM ready for next test
```
#### 6.5.3 Modified restore workflow
**OLD workflow (problematic):**
```
1. Start VM → database files exist from previous test
2. Shutdown existing database
3. Delete control files manually
4. Restore → may fail if old files interfere
5. Manually cleanup after test
```
**NEW workflow (clean and repeatable):**
```
1. Start VM → clean state (no database, only software)
2. Cleanup script: delete any leftover files + recreate service
3. Restore from F:\ backups → fresh database
4. Verify and test
5. Cleanup script: delete database files
6. Shutdown VM → ready for next test
```
#### 6.5.4 Scripts created and their usage
**File 1:** `D:\oracle\scripts\cleanup_database.cmd`
- **Purpose:** Standalone cleanup script
- **What it does:**
- Stops and deletes Oracle service
- Deletes all database files (datafiles, control files, redo logs)
- Deletes local FRA (backups are on F:\, safe to delete)
- Recreates Oracle service from PFILE
- **When to use:**
- Before weekly test restore (to start from clean state)
- After weekly test restore (to clean up and save disk space)
- Manual cleanup when needed
- **Never use:** In real disaster scenario (you want to keep the database!)
**File 2:** `D:\oracle\scripts\rman_restore_from_zero.cmd`
- **Purpose:** Full restore workflow (cleanup BEFORE restore only)
- **What it does:**
- Calls cleanup_database.cmd at START
- Verifies F:\ mount is accessible
- Restores database from F:\ backups
- Opens database with RESETLOGS
- Verifies database is working
- Does NOT cleanup after restore (database remains running)
- **When to use:**
- Weekly test restore (then manually run cleanup_database.cmd after testing)
- **Real disaster scenario** (database remains running for production use)
- **Result:** Database is OPEN and ready to use
**File 3:** `D:\oracle\scripts\rman_restore_final.cmd` (legacy)
- **Purpose:** Restore without cleanup (assumes database files may exist)
- **When to use:** Only if rman_restore_from_zero.cmd fails
- **Recommendation:** Use rman_restore_from_zero.cmd instead
#### 6.5.5 Usage workflows
**A. Weekly Test Restore (Saturday morning):**
```cmd
REM 1. Start VM and verify F:\ mount
dir F:\ROA\autobackup
REM 2. Run restore (includes cleanup before restore)
D:\oracle\scripts\rman_restore_from_zero.cmd
REM 3. Verify database is working
sqlplus / as sysdba
SQL> SELECT * FROM V$DATABASE;
REM 4. Test application connectivity (optional)
REM 5. Cleanup after test to free disk space
D:\oracle\scripts\cleanup_database.cmd
REM 6. Shutdown VM
shutdown /s /t 60
```
**B. Real Disaster Scenario (production restore):**
```cmd
REM 1. Start VM and verify F:\ mount
dir F:\ROA\autobackup
REM 2. Run restore (includes cleanup before restore)
D:\oracle\scripts\rman_restore_from_zero.cmd
REM 3. Database is now OPEN and ready for production use
REM DO NOT run cleanup_database.cmd after this!
REM 4. Update application connection strings to point to DR VM
REM 5. Keep VM running for production use
```
**C. Manual cleanup (when VM gets full):**
```cmd
REM Run cleanup to free ~8 GB disk space
D:\oracle\scripts\cleanup_database.cmd
```
#### 6.5.6 Important notes
⚠️ **CRITICAL: cleanup_database.cmd deletes the entire database!**
- Use it BEFORE weekly test restore (to start clean)
- Use it AFTER weekly test restore (to free disk space)
- **NEVER use it after a real disaster restore!** (you need the database running!)
**For weekly tests:**
- Run: `rman_restore_from_zero.cmd` → test → `cleanup_database.cmd` → shutdown VM
- Result: VM is clean and ready for next test
**For real disaster:**
- Run: `rman_restore_from_zero.cmd` → database is ready → **DO NOT cleanup!**
- Result: Database remains running for production use
---
### PHASE 7: Weekly Test Procedure (1 hour first time, 30 min ongoing)
**Objective:** Document weekly test procedure using new cumulative backup strategy
@@ -801,56 +1051,77 @@ After completing implementation:
- [x] PowerShell mount script created (`C:\Scripts\mount-nfs.ps1`)
- [x] Scheduled task "Mount NFS F" created for auto-mount at startup
- [x] F:\ drive persists after VM reboot
- [ ] RMAN script modified to CUMULATIVE (keyword added) - **Already has CUMULATIVE**
- [ ] Transfer scripts updated to send to Proxmox host
- [ ] SSH key for Proxmox host created and tested
- [ ] Scheduled task created for 13:00 CUMULATIVE backup on PRIMARY
- [ ] Scheduled task created for 18:00 CUMULATIVE backup on PRIMARY
- [ ] Existing 02:30 FULL task updated to use new transfer script
- [ ] Manual test of CUMULATIVE backup successful
- [ ] Manual test of backup transfer to host successful
- [ ] DR restore script updated to use F:\ mount
- [ ] Full end-to-end restore test successful
- [ ] Weekly test script created and tested
- [ ] Documentation updated (STATUS and IMPLEMENTATION_PLAN docs)
- [x] RMAN script modified to CUMULATIVE (keyword added) - **Already has CUMULATIVE**
- [x] Transfer scripts updated to send to Proxmox host
- [x] SSH key for Proxmox host created and tested
- [x] Scheduled task created for 13:00 CUMULATIVE backup on PRIMARY
- [x] Scheduled task created for 18:00 CUMULATIVE backup on PRIMARY
- [x] Existing 02:30 FULL task updated to use new transfer script
- [x] Manual test of FULL backup successful (executed on PRIMARY)
- [x] Manual test of backup transfer to host successful (6.7 GB transferred)
- [x] DR restore scripts updated to use F:\ mount (both rman_restore_final.cmd and rman_restore_from_zero.cmd)
- [x] Cleanup script created and tested (cleanup_database.cmd)
- [x] Restore from zero script created (rman_restore_from_zero.cmd)
- [ ] Full end-to-end restore test successful (ready to run, scripts fixed)
- [ ] Weekly test procedure documented and tested
- [x] Documentation updated (DR_UPGRADE_TO_CUMULATIVE_PLAN.md)
---
## 📞 NEXT SESSION HANDOFF
**Status:** 🟡 PHASES 1-3-5 COMPLETED - Continue with Phases 4, 6, 7
**Estimated Remaining Time:** 1.5-2 hours
**Recommended Schedule:** Saturday morning (low activity time)
**Status:** 🟢 ALL PHASES COMPLETE - Only final restore test remaining (15-20 min)
**Estimated Remaining Time:** 15-20 minutes (one restore test)
**Recommended Schedule:** Next session (anytime, all infrastructure ready)
**Context for next session:**
1. Primary server: 10.0.20.36 (Windows, Oracle 19c, database ROA)
2. DR VM: 109 on pveelite (10.0.20.37, **F:\ NFS mount configured and working** ✅)
2. DR VM: 109 on pveelite (10.0.20.37, **F:\ NFS mount working** ✅)
3. Proxmox host: pveelite (10.0.20.202, **NFS server running** ✅)
4. Goal: Complete scheduled tasks on PRIMARY + update restore script + test end-to-end
5. **Completed:** Storage setup, NFS mount, transfer scripts to Proxmox
4. **Backups:** 6.7 GB already on F:\ ready for restore ✅
5. **All scripts fixed and ready**
**What's DONE:**
- ✅ Proxmox host storage (`/mnt/pve/oracle-backups/ROA/autobackup`)
-NFS server on Proxmox with export for VM 109
-NFS Client in Windows VM 109
-F:\ drive auto-mounts at startup via scheduled task
-Transfer scripts modified to send to Proxmox host
-RMAN script already has CUMULATIVE keyword
**What's DONE (100% implementation):**
- ✅ Proxmox host storage + NFS server configured
-F:\ NFS mount auto-mounts at VM startup
-Transfer scripts → Proxmox host (tested, working)
-RMAN script has CUMULATIVE keyword
-SSH keys configured (PRIMARY → Proxmox)
-Scheduled tasks on PRIMARY: 02:30 FULL, 13:00 + 18:00 CUMULATIVE
-**Backup transferred:** 6.7 GB on F:\ROA\autobackup
-**cleanup_database.cmd:** Tested, working (deletes DB, service)
-**rman_restore_from_zero.cmd:** Created, debugged, ready to test
-**Control file restore FIXED:** Now uses `RESTORE CONTROLFILE FROM AUTOBACKUP`
-**Documentation complete:** All workflows documented
**Next steps to complete:**
**Next steps (ONLY ONE TEST remaining):**
```bash
# Phase 4 - Update scheduled tasks on PRIMARY (45 min)
# 1. Setup SSH key from PRIMARY to Proxmox
# 2. Test transfer scripts
# 3. Create/modify scheduled tasks for 13:00 and 18:00
# Phase 7 - Final end-to-end test (15-20 min)
# On VM 109 (via RDP or SSH):
D:\oracle\scripts\rman_restore_from_zero.cmd
# Phase 6 - Update DR restore script (30 min)
# Modify restore script to use F:\ instead of local path
# Expected flow:
# 1. Cleanup (deletes DB + service)
# 2. Creates Oracle service
# 3. STARTUP NOMOUNT
# 4. Restores control file from F:\
# 5. MOUNT database
# 6. Catalogs backups from F:\
# 7. RESTORE DATABASE (5 GB, ~10-12 min)
# 8. RECOVER DATABASE
# 9. OPEN RESETLOGS
# 10. Verify database
# Phase 7 - End-to-end test (30 min)
# Full test: backup → transfer → mount → restore
# If successful:
# - Test cleanup: D:\oracle\scripts\cleanup_database.cmd
# - Shutdown VM
# - PROJECT COMPLETE! ✅
```
**Known issues (ALL FIXED):**
-~~Log file name~~ → ✅ Fixed: simple name
-~~Control file wildcard~~ → ✅ Fixed: AUTOBACKUP
**IMPORTANT - Backup manual înainte de modificări:**
Fă backup MANUAL la fișierele pe care le vei modifica:
```powershell