- Add full path to FORMAT in rman_backup.txt and rman_backup_incremental.txt
- Files now stored in C:\Users\oracle\recovery_area\ROA\autobackup- Fixes issue where backups were created in ORACLE_HOME\DATABASE instead of recovery area
- Ensures transfer_to_dr.ps1 can find and transfer all backups correctly
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Add FORMAT to rman_backup.txt: L0_*, ARC_*, SPFILE_*, CF_*
- Add FORMAT to rman_backup_incremental.txt: L1_*, ARC_*, SPFILE_*, CF_*
- Update rman_restore_from_zero.ps1 TestMode to select files by naming convention
- Select only latest L0 backup set + all L1 incrementals/archives (faster DR tests)
- Backward compatible with old autobackup naming (fallback to copy all)
- Fixes missing datafiles issue (previously only copied 8 files, now copies full backup set)
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Critical fix based on user analysis:
PROBLEM:
Cleanup is called in 2 contexts with different requirements:
1. BEFORE restore (from rman_restore): Should NOT shutdown
2. AFTER restore (from weekly-test): MUST shutdown to delete files
USER INSIGHT:
"Why shutdown if restore will clean anyway? But AFTER restore,
you MUST shutdown to release file locks for deletion!"
SOLUTION:
Add /AFTER parameter to cleanup_database.ps1:
WITHOUT /AFTER (before restore):
- Skip SHUTDOWN ABORT
- Skip Stop-Service
- Leave service in current state (running/stopped)
- Files CAN be deleted (no lock before restore)
- Optimization: If service running → restore saves ~30s
WITH /AFTER (after restore):
- SHUTDOWN ABORT (stop instance)
- Stop-Service (release file locks)
- REQUIRED for file deletion after restore
- Files are locked by active instance/service
CALL SITES:
1. rman_restore: cleanup_database.ps1 /SILENT (no /AFTER)
2. weekly-test: cleanup_database.ps1 /SILENT /AFTER (with /AFTER)
FLOW OPTIMIZATION:
Test 1: Service stopped → start(30s) → restore → cleanup /AFTER
Test 2: Service stopped → start(30s) → restore → cleanup /AFTER
→ No improvement yet
BUT if we keep service running between tests:
Test 1: Service stopped → start(30s) → restore → cleanup /AFTER
Test 2: Service running → restore(0s saved!) → cleanup /AFTER
→ Save 30s on subsequent tests!
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Critical fix for service auto-start behavior:
Problem (identified by user):
- When Oracle service starts, it automatically tries to start instance
- Uses configured PFILE which references control files
- After cleanup, control files don't exist
- Instance ends up in partial/error state
- STARTUP NOMOUNT may fail or behave unexpectedly
Root Cause:
- Oracle service on Windows has auto-start behavior
- Service startup takes ~30s trying to start instance
- Without valid control files, instance is partially started
- This interferes with manual STARTUP NOMOUNT
Solution:
Before STARTUP NOMOUNT, explicitly clean any existing instance:
```sql
SHUTDOWN ABORT; -- Clean any partial instance
STARTUP NOMOUNT PFILE='...'; -- Fresh clean start
```
Implementation:
- Use WHENEVER SQLERROR CONTINUE (SHUTDOWN may error if no instance)
- Explicit SHUTDOWN ABORT before NOMOUNT
- Ensures clean instance state for RMAN restore
- Service running + clean NOMOUNT instance = ready for restore
User requirement met: Instance in NOMOUNT state (not mounted/open)
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Critical fix - service MUST be running for SQL*Plus connection:
Problem (confirmed by user):
- After cleanup, service is stopped
- sqlplus / as sysdba → ORA-12560: TNS:protocol adapter error
- Start-Service blocks indefinitely (user saw 25+ warnings)
- Service takes ~30 seconds to start
Previous attempt (WRONG):
- Assumed SQL*Plus works with stopped service ✗
- User proved ORA-12560 occurs when service stopped ✓
Correct Solution:
- Start service in background job (non-blocking)
- Poll service status every 3 seconds
- Timeout after 60 seconds (2x expected startup time)
- Progress logging every 15 seconds
- Cleanup background job when done
Implementation:
```powershell
Start-Job { Start-Service OracleServiceROA }
while (elapsed < 60s) {
if (service.Status == Running) → break
sleep 3s
}
```
Result:
- Service starts in ~30s (user confirmed)
- Script doesn't block
- SQL*Plus can connect successfully
- Graceful fallback if timeout exceeded
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Critical fix for service preservation:
Problem:
- After cleanup, Oracle service is stopped
- Start-Service attempts to start the instance automatically
- Without database files, service startup hangs indefinitely
- PowerShell Start-Service blocks waiting for service to start
Root Cause:
- Oracle service on Windows tries to auto-start the instance
- With no controlfile/database files, it cannot start
- Start-Service waits forever (user reported 25+ warnings)
Solution:
- Do NOT attempt to start the stopped service
- SQL*Plus can connect '/ as sysdba' even if service is stopped
- STARTUP NOMOUNT will manually start the instance
- This is the correct Oracle workflow for restore from zero
Windows SQL*Plus requirements:
✓ ORACLE_SID set (we set this)
✓ Service exists in registry (preserved after cleanup)
✓ ORACLE_HOME set (we set this)
✗ Service running NOT required for NOMOUNT startup
The service will naturally transition to Running state when
STARTUP NOMOUNT successfully starts the instance.
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Performance optimization for VM startup:
Before: Fixed 180s wait regardless of actual boot time
After: Intelligent polling with early exit when VM is ready
Implementation:
- Poll every 5 seconds (max 180s timeout)
- Check 1: VM running status in Proxmox (qm status)
- Check 2: SSH connectivity test
- Check 3: PowerShell availability (what we actually need)
- Exit immediately when all checks pass
- Progress logging every 30 seconds
- Fallback: Continue after 180s with warning
Benefits:
- Fast VM boot (30s) → saves 150s (2min 30s)
- Normal VM boot (60s) → saves 120s (2min)
- Slow VM boot → 180s (same as before)
- More robust: verifies SSH+PowerShell actually work
Average expected improvement: 60-120 seconds per test
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Critical fixes and improvements:
1. Database verification fix (robust):
- Use Select-String -Quiet to get True/False boolean
- Convert PowerShell boolean to bash-friendly format
- Check for 'READ WRITE' in entire sqlplus output
- Eliminates false negatives from text parsing issues
2. Collect FULL RMAN restore log:
- Removed -Head 200 limitation
- Now sends complete RMAN log in email
- Better debugging with full context
- Updated templates: "first 200 lines" → "complete"
3. Add bash script log to email notifications:
- Include last 100 lines of bash execution log
- Separate "RMAN Restore Log" and "Bash Script Log" sections
- Both text and HTML templates updated
- Shows script flow and any bash-level errors
This fixes the issue where 42,625 tables were restored successfully
but test reported FAILED due to query output format mismatch.
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Critical fixes for false negatives in DR test reporting:
1. Database verification fix:
- Changed from 'findstr' (CMD) to 'Select-String' (PowerShell native)
- findstr was failing in PowerShell context causing db_status to be empty
- Result: DB with 42,625 tables was incorrectly reported as FAILED
2. Restore log collection fix:
- Changed from 'type' (CMD) to 'Get-Content' (PowerShell native)
- type command doesn't work through SSH PowerShell context
- Added -ErrorAction SilentlyContinue for cleaner error handling
- Simplified fallback logic using [-z] instead of string matching
Both issues were caused by mixing CMD commands in PowerShell context.
Now uses PowerShell-native commands throughout for consistency.
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Adjust db_recovery_file_dest_size in auto-generated PFILE:
- Previous: 20G
- New: 50G
- Reason: Provide more space for RMAN restore operations and backups
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Enhancement to rman_restore_from_zero.ps1:
- Auto-generate initROA.ora if not found at service creation
- Uses exact tested configuration from initROA.ora:
- memory_target=1024M (tested DR VM allocation)
- _allow_resetlogs_corruption=TRUE (critical for DR restore!)
- control_files in oradata + recovery_area
- Standard Oracle 19c parameters for DR environment
Benefits:
- Script is now fully self-sufficient
- No manual PFILE setup required
- DR VM can be restored from completely clean state
- Uses battle-tested configuration
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Performance improvements:
- cleanup_database.ps1: Skip service deletion (saves ~25s per test)
- Remove oradim -delete, sc.exe delete, registry cleanup
- Add SPFILE deletion to ensure PFILE-based startup
- Service now persists between tests for reuse
- rman_restore_from_zero.ps1: Smart service check (saves ~15s per test)
- Check if service exists before creating
- Skip oradim -new if service already present
- Only create service on first run or if missing
Total time savings: ~40 seconds per weekly DR test
Service lifecycle: Created once, reused indefinitely until manual cleanup
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Fix CROSSCHECK BACKUP command to execute after database is mounted
- Correct CATALOG command to use recovery_area instead of F:\ path
- Add robust backup file validation with detailed error reporting
- Improve file-by-file backup copying with individual error tracking
- Enhance restore log collection for both success and failure scenarios
- Fix database verification to check OPEN_MODE instead of STATUS
- Add comprehensive directory and permissions error handling
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
rman_restore_from_zero.ps1:
- Add -TestMode switch parameter
- TestMode (weekly DR test): Skip service/listener config, only verify restore works
- Standalone mode: Full config with SPFILE + Listener for production use
weekly-dr-test-proxmox.sh:
- Call restore script with -TestMode flag
- Avoids service recreation and SSH disconnect during tests
Benefits:
- Weekly tests are faster and cleaner (no service restart)
- Manual restore prepares system for production use
- No more 'Broken pipe' errors during tests
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Remove service delete/recreate at step 3.3 that was causing 'Broken pipe' error
Service is already configured with auto-start at step 2.1 - no need to recreate
Issue: oradim -delete was killing running database and breaking SSH connection
Solution: Skip recreation, service already has correct auto-start configuration
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
weekly-dr-test-proxmox.sh:
- Replace Unix commands (echo, grep) with PowerShell equivalents
- Use PowerShell Select-String for database status verification
- Fix table count query to work properly through SSH
rman_restore_from_zero.ps1:
- Set Oracle service to AUTOMATIC startup (was manual)
- Set Listener service to AUTOMATIC startup
- Auto-start Listener after database restore
- Add fallback to lsnrctl if service start fails
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Changes:
- Remove obsolete .cmd scripts (cleanup_database.cmd, rman_restore_from_zero.cmd, rman_restore_final.cmd)
- Update weekly-dr-test-proxmox.sh to call PowerShell scripts with /SILENT parameter
- Add initROA.ora configuration file for reference
All DR test scripts now use PowerShell for SSH compatibility
Resolves input redirection issues with Windows SSH
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Add cleanup_database.ps1: PowerShell version without input redirection issues
- Add rman_restore_from_zero.ps1: PowerShell version with inline SQL commands
- Update weekly-dr-test-proxmox.sh: Call .ps1 scripts via PowerShell
PowerShell scripts resolve SSH 'Input redirection not supported' errors
All SQL commands are piped directly to sqlplus (no temp files needed)
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Add rman_backup.bat: FULL backup with live console output and log file
- Add rman_backup_incremental.bat: INCREMENTAL backup with live output
- Add rman_backup.txt: RMAN script for LEVEL 0 FULL backup
- Add rman_backup_incremental.txt: RMAN script for LEVEL 1 CUMULATIVE backup
- Scripts are portable: use current directory instead of hardcoded paths
- Logging: simultaneous output to console AND log file using PowerShell Tee-Object
- Log files saved in logs/ subdirectory with timestamps
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Convert complex HTML/CSS templates to plain text format for Gmail compatibility
- Replace decorative characters (box drawing, special symbols) with simple text
- Use single-line bullet points instead of complex table layouts
- Improve readability across all email clients (Gmail, Outlook, mobile)
- Remove HTML templates completely, use only text format
- Keep informative structure with clear section separators
- Both text and HTML templates now identical for consistency
- Critical for Gmail users who only see plain text formatting
New format works perfectly in Gmail:
Oracle Backup WARNING - pveelite
WARNING
========================================
WARNINGS:
- FULL backup is 51 hours old (threshold: 25)
========================================
BACKUP STATUS:
FULL: 51h old TOO OLD (limit: 25h)
CUMULATIVE: 4h old OK (limit: 7h)
Total: 12 files | Size: 6.3GB | Disk: 2%
========================================
Next check: 2025-10-10 + 24h | Proxmox Monitoring
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Fix Proxmox template compatibility: {{hostname}} → {{node}}, {{timestamp}} → {{date}}
- Remove duplicate node fields and fix JSON structure
- Complete full testing plan execution for monitoring and DR test scripts
- Validate notification system functionality with PVE::Notify
- Sync tested scripts from production back to repository
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Remove outdated planning documents and implementation guides
- Update README with comprehensive DR procedures and monitoring
- Enhance rman_restore_from_zero.cmd with SPFILE creation and auto-start
- Add Proxmox monitoring and weekly test scripts
- Archive old implementation documentation
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
COMPLETED:
- Phase 1: Proxmox host storage (/mnt/pve/oracle-backups/ROA/autobackup)
- Phase 2: RMAN script already has CUMULATIVE keyword
- Phase 3: Transfer scripts updated for Proxmox host
* transfer_incremental.ps1: 10.0.20.37:22122 → 10.0.20.202:22
* transfer_to_dr.ps1: Same change
* Converted Windows PowerShell to Linux bash commands
- VM 109 cleanup: ~6.4 GB freed, RMAN catalog cleaned
NEW FILES:
- copy_existing_key_to_proxmox.ps1: Setup script for SSH key
- setup_ssh_keys_for_proxmox.ps1: Alternative setup (not used)
PENDING (Next Session):
- Run copy_existing_key_to_proxmox.ps1 on PRIMARY as Administrator
- Phase 4: Modify scheduled tasks (13:00 + 18:00)
- Phase 5: Configure mount point on VM 109 (F:\ drive)
- Phase 6: Update restore script for F:\ mount
- Phase 7: Test FULL + CUMULATIVE backup and restore
DOCUMENTATION:
- DR_UPGRADE_TO_CUMULATIVE_PLAN.md: Added implementation status
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
Major changes:
- Implemented Windows VM 109 as DR target (replaces Linux LXC)
- Tested RMAN restore successfully (12-15 min RTO, 24h RPO)
- Added comprehensive DR documentation:
* DR_WINDOWS_VM_STATUS_2025-10-09.md - Current implementation status
* DR_UPGRADE_TO_CUMULATIVE_PLAN.md - Plan for cumulative incremental backups
* DR_VM_MIGRATION_GUIDE.md - Guide for VM migration between Proxmox nodes
- Updated DR_WINDOWS_VM_IMPLEMENTATION_PLAN.md with completed phases
New scripts:
- add_system_key_dr.ps1 - SSH key setup for automated transfers
- configure_listener_dr.ps1 - Oracle Listener configuration
- fix_ssh_via_service.ps1 - SSH authentication fix
- rman_restore_final.cmd - Working RMAN restore script (tested)
- transfer_to_dr.ps1 - FULL backup transfer (renamed from 02_*)
- transfer_incremental.ps1 - Incremental backup transfer (renamed from 02b_*)
Cleanup:
- Removed 19 obsolete scripts for Linux LXC DR
- Removed 8 outdated documentation files
- Organized project structure
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add claude-mcp-toggle: CLI tool for managing MCP servers
- Enable/disable individual MCP servers
- Enable/disable all servers
- Set specific servers (disable all, enable selected)
- Interactive mode with menu
- List servers with enabled/disabled status
- Add comprehensive README with usage examples
- Add Oracle DR restore troubleshooting documentation
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add comprehensive Oracle backup and DR strategy documentation
- Add RMAN backup scripts (full and incremental)
- Add PowerShell transfer scripts for DR site
- Add bash restore and verification scripts
- Reorganize Oracle documentation structure
- Add Proxmox troubleshooting guide for VM 201 HA errors and NFS storage issues
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>