13 Commits

Author SHA1 Message Date
Marius
62848e635d Oracle DR: Add naming convention to RMAN backups for smart restore selection
- Add FORMAT to rman_backup.txt: L0_*, ARC_*, SPFILE_*, CF_*
- Add FORMAT to rman_backup_incremental.txt: L1_*, ARC_*, SPFILE_*, CF_*
- Update rman_restore_from_zero.ps1 TestMode to select files by naming convention
- Select only latest L0 backup set + all L1 incrementals/archives (faster DR tests)
- Backward compatible with old autobackup naming (fallback to copy all)
- Fixes missing datafiles issue (previously only copied 8 files, now copies full backup set)

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 16:12:41 +03:00
Marius
5af33fc217 Oracle DR: Add SHUTDOWN ABORT before NOMOUNT to clean auto-started instance
Critical fix for service auto-start behavior:

Problem (identified by user):
- When Oracle service starts, it automatically tries to start instance
- Uses configured PFILE which references control files
- After cleanup, control files don't exist
- Instance ends up in partial/error state
- STARTUP NOMOUNT may fail or behave unexpectedly

Root Cause:
- Oracle service on Windows has auto-start behavior
- Service startup takes ~30s trying to start instance
- Without valid control files, instance is partially started
- This interferes with manual STARTUP NOMOUNT

Solution:
Before STARTUP NOMOUNT, explicitly clean any existing instance:
```sql
SHUTDOWN ABORT;  -- Clean any partial instance
STARTUP NOMOUNT PFILE='...';  -- Fresh clean start
```

Implementation:
- Use WHENEVER SQLERROR CONTINUE (SHUTDOWN may error if no instance)
- Explicit SHUTDOWN ABORT before NOMOUNT
- Ensures clean instance state for RMAN restore
- Service running + clean NOMOUNT instance = ready for restore

User requirement met: Instance in NOMOUNT state (not mounted/open)

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 15:18:45 +03:00
Marius
4b7bb29b9e Oracle DR: Fix service start with polling and timeout (not blocking)
Critical fix - service MUST be running for SQL*Plus connection:

Problem (confirmed by user):
- After cleanup, service is stopped
- sqlplus / as sysdba → ORA-12560: TNS:protocol adapter error
- Start-Service blocks indefinitely (user saw 25+ warnings)
- Service takes ~30 seconds to start

Previous attempt (WRONG):
- Assumed SQL*Plus works with stopped service ✗
- User proved ORA-12560 occurs when service stopped ✓

Correct Solution:
- Start service in background job (non-blocking)
- Poll service status every 3 seconds
- Timeout after 60 seconds (2x expected startup time)
- Progress logging every 15 seconds
- Cleanup background job when done

Implementation:
```powershell
Start-Job { Start-Service OracleServiceROA }
while (elapsed < 60s) {
    if (service.Status == Running) → break
    sleep 3s
}
```

Result:
- Service starts in ~30s (user confirmed)
- Script doesn't block
- SQL*Plus can connect successfully
- Graceful fallback if timeout exceeded

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 15:14:24 +03:00
Marius
e4df4c11d8 Oracle DR: Fix service start hang - don't start stopped service
Critical fix for service preservation:

Problem:
- After cleanup, Oracle service is stopped
- Start-Service attempts to start the instance automatically
- Without database files, service startup hangs indefinitely
- PowerShell Start-Service blocks waiting for service to start

Root Cause:
- Oracle service on Windows tries to auto-start the instance
- With no controlfile/database files, it cannot start
- Start-Service waits forever (user reported 25+ warnings)

Solution:
- Do NOT attempt to start the stopped service
- SQL*Plus can connect '/ as sysdba' even if service is stopped
- STARTUP NOMOUNT will manually start the instance
- This is the correct Oracle workflow for restore from zero

Windows SQL*Plus requirements:
✓ ORACLE_SID set (we set this)
✓ Service exists in registry (preserved after cleanup)
✓ ORACLE_HOME set (we set this)
✗ Service running NOT required for NOMOUNT startup

The service will naturally transition to Running state when
STARTUP NOMOUNT successfully starts the instance.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 15:12:21 +03:00
Marius
4256d5a914 Oracle DR: Optimize backup copy - TestMode only copies latest backup set
Major performance optimization for weekly DR tests:

TestMode (weekly testing):
- Copy ONLY latest full backup + everything after it
- Includes: latest DAILY_FULL + incrementals + controlfiles + SPFILE
- Excludes: older full backups (not needed for testing)
- Benefit: ~60-70% reduction (14GB → 4-5GB)
- Copy time: 2min → 30-45sec (saves ~1-1.5 min)
- Risk: Low - testing only needs to verify latest backup works

Standalone Mode (real DR):
- Copy ALL backups (unchanged behavior)
- Includes: all full backups + redundancy for fallback
- Benefit: Maximum safety for disaster recovery
- If latest backup corrupted → RMAN uses previous backup

Implementation:
- Finds latest *DAILY_FULL*.BKP (Level 0 backup)
- Gets its timestamp
- Copies all *.BKP files >= that timestamp
- Automatic inclusion of incrementals, controlfiles, SPFILE backups

Combined optimization results:
- VM polling: saves 60-120s
- Service preservation: saves 40s
- Backup copy (TestMode): saves 60-90s
Total: 160-250 seconds (2.5-4 minutes) per test

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 14:58:12 +03:00
Marius
c4504cac70 Oracle DR: Increase recovery area size to 50G
Adjust db_recovery_file_dest_size in auto-generated PFILE:
- Previous: 20G
- New: 50G
- Reason: Provide more space for RMAN restore operations and backups

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 11:06:33 +03:00
Marius
eade344f28 Oracle DR: Auto-create PFILE if missing using tested configuration
Enhancement to rman_restore_from_zero.ps1:
- Auto-generate initROA.ora if not found at service creation
- Uses exact tested configuration from initROA.ora:
  - memory_target=1024M (tested DR VM allocation)
  - _allow_resetlogs_corruption=TRUE (critical for DR restore!)
  - control_files in oradata + recovery_area
  - Standard Oracle 19c parameters for DR environment

Benefits:
- Script is now fully self-sufficient
- No manual PFILE setup required
- DR VM can be restored from completely clean state
- Uses battle-tested configuration

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 11:05:17 +03:00
Marius
5bed910b1c Oracle DR: Optimize test speed by preserving service between tests
Performance improvements:
- cleanup_database.ps1: Skip service deletion (saves ~25s per test)
  - Remove oradim -delete, sc.exe delete, registry cleanup
  - Add SPFILE deletion to ensure PFILE-based startup
  - Service now persists between tests for reuse

- rman_restore_from_zero.ps1: Smart service check (saves ~15s per test)
  - Check if service exists before creating
  - Skip oradim -new if service already present
  - Only create service on first run or if missing

Total time savings: ~40 seconds per weekly DR test
Service lifecycle: Created once, reused indefinitely until manual cleanup

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 10:59:43 +03:00
Marius
3a51880c9e Oracle DR: Fix RMAN crosscheck sequence and improve error handling
- Fix CROSSCHECK BACKUP command to execute after database is mounted
- Correct CATALOG command to use recovery_area instead of F:\ path
- Add robust backup file validation with detailed error reporting
- Improve file-by-file backup copying with individual error tracking
- Enhance restore log collection for both success and failure scenarios
- Fix database verification to check OPEN_MODE instead of STATUS
- Add comprehensive directory and permissions error handling

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 10:32:49 +03:00
Marius
9ed0ee9e0e Oracle DR: Add TestMode parameter for dual behavior
rman_restore_from_zero.ps1:
- Add -TestMode switch parameter
- TestMode (weekly DR test): Skip service/listener config, only verify restore works
- Standalone mode: Full config with SPFILE + Listener for production use

weekly-dr-test-proxmox.sh:
- Call restore script with -TestMode flag
- Avoids service recreation and SSH disconnect during tests

Benefits:
- Weekly tests are faster and cleaner (no service restart)
- Manual restore prepares system for production use
- No more 'Broken pipe' errors during tests

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 02:33:43 +03:00
Marius
f79331f7cc Oracle DR: Fix service recreation causing SSH disconnect
Remove service delete/recreate at step 3.3 that was causing 'Broken pipe' error
Service is already configured with auto-start at step 2.1 - no need to recreate

Issue: oradim -delete was killing running database and breaking SSH connection
Solution: Skip recreation, service already has correct auto-start configuration

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 02:29:38 +03:00
Marius
6accd1f996 Oracle DR: Fix verification commands and auto-start services
weekly-dr-test-proxmox.sh:
- Replace Unix commands (echo, grep) with PowerShell equivalents
- Use PowerShell Select-String for database status verification
- Fix table count query to work properly through SSH

rman_restore_from_zero.ps1:
- Set Oracle service to AUTOMATIC startup (was manual)
- Set Listener service to AUTOMATIC startup
- Auto-start Listener after database restore
- Add fallback to lsnrctl if service start fails

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 02:03:57 +03:00
Marius
2f8c927bbe Oracle DR: Convert restore scripts to PowerShell for SSH compatibility
- Add cleanup_database.ps1: PowerShell version without input redirection issues
- Add rman_restore_from_zero.ps1: PowerShell version with inline SQL commands
- Update weekly-dr-test-proxmox.sh: Call .ps1 scripts via PowerShell

PowerShell scripts resolve SSH 'Input redirection not supported' errors
All SQL commands are piped directly to sqlplus (no temp files needed)

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 01:26:35 +03:00