- Replace complex SSH+PowerShell query with simple log file parsing
- rman_restore_from_zero.ps1 already verifies and outputs database status
- Parse 'OPEN_MODE: READ WRITE' and 'TABLES: <count>' from LOG_FILE
- Fixes issue where successful restore was reported as FAILED
- More reliable: avoids SSH escaping issues with Select-String -Quiet
Root cause: SSH+PowerShell+sqlplus+Select-String chain was too fragile and
returned empty/false even when database was successfully opened (42625 tables).
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Add full path to FORMAT in rman_backup.txt and rman_backup_incremental.txt
- Files now stored in C:\Users\oracle\recovery_area\ROA\autobackup- Fixes issue where backups were created in ORACLE_HOME\DATABASE instead of recovery area
- Ensures transfer_to_dr.ps1 can find and transfer all backups correctly
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Add FORMAT to rman_backup.txt: L0_*, ARC_*, SPFILE_*, CF_*
- Add FORMAT to rman_backup_incremental.txt: L1_*, ARC_*, SPFILE_*, CF_*
- Update rman_restore_from_zero.ps1 TestMode to select files by naming convention
- Select only latest L0 backup set + all L1 incrementals/archives (faster DR tests)
- Backward compatible with old autobackup naming (fallback to copy all)
- Fixes missing datafiles issue (previously only copied 8 files, now copies full backup set)
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Critical fix based on user analysis:
PROBLEM:
Cleanup is called in 2 contexts with different requirements:
1. BEFORE restore (from rman_restore): Should NOT shutdown
2. AFTER restore (from weekly-test): MUST shutdown to delete files
USER INSIGHT:
"Why shutdown if restore will clean anyway? But AFTER restore,
you MUST shutdown to release file locks for deletion!"
SOLUTION:
Add /AFTER parameter to cleanup_database.ps1:
WITHOUT /AFTER (before restore):
- Skip SHUTDOWN ABORT
- Skip Stop-Service
- Leave service in current state (running/stopped)
- Files CAN be deleted (no lock before restore)
- Optimization: If service running → restore saves ~30s
WITH /AFTER (after restore):
- SHUTDOWN ABORT (stop instance)
- Stop-Service (release file locks)
- REQUIRED for file deletion after restore
- Files are locked by active instance/service
CALL SITES:
1. rman_restore: cleanup_database.ps1 /SILENT (no /AFTER)
2. weekly-test: cleanup_database.ps1 /SILENT /AFTER (with /AFTER)
FLOW OPTIMIZATION:
Test 1: Service stopped → start(30s) → restore → cleanup /AFTER
Test 2: Service stopped → start(30s) → restore → cleanup /AFTER
→ No improvement yet
BUT if we keep service running between tests:
Test 1: Service stopped → start(30s) → restore → cleanup /AFTER
Test 2: Service running → restore(0s saved!) → cleanup /AFTER
→ Save 30s on subsequent tests!
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Critical fix for service auto-start behavior:
Problem (identified by user):
- When Oracle service starts, it automatically tries to start instance
- Uses configured PFILE which references control files
- After cleanup, control files don't exist
- Instance ends up in partial/error state
- STARTUP NOMOUNT may fail or behave unexpectedly
Root Cause:
- Oracle service on Windows has auto-start behavior
- Service startup takes ~30s trying to start instance
- Without valid control files, instance is partially started
- This interferes with manual STARTUP NOMOUNT
Solution:
Before STARTUP NOMOUNT, explicitly clean any existing instance:
```sql
SHUTDOWN ABORT; -- Clean any partial instance
STARTUP NOMOUNT PFILE='...'; -- Fresh clean start
```
Implementation:
- Use WHENEVER SQLERROR CONTINUE (SHUTDOWN may error if no instance)
- Explicit SHUTDOWN ABORT before NOMOUNT
- Ensures clean instance state for RMAN restore
- Service running + clean NOMOUNT instance = ready for restore
User requirement met: Instance in NOMOUNT state (not mounted/open)
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Critical fix - service MUST be running for SQL*Plus connection:
Problem (confirmed by user):
- After cleanup, service is stopped
- sqlplus / as sysdba → ORA-12560: TNS:protocol adapter error
- Start-Service blocks indefinitely (user saw 25+ warnings)
- Service takes ~30 seconds to start
Previous attempt (WRONG):
- Assumed SQL*Plus works with stopped service ✗
- User proved ORA-12560 occurs when service stopped ✓
Correct Solution:
- Start service in background job (non-blocking)
- Poll service status every 3 seconds
- Timeout after 60 seconds (2x expected startup time)
- Progress logging every 15 seconds
- Cleanup background job when done
Implementation:
```powershell
Start-Job { Start-Service OracleServiceROA }
while (elapsed < 60s) {
if (service.Status == Running) → break
sleep 3s
}
```
Result:
- Service starts in ~30s (user confirmed)
- Script doesn't block
- SQL*Plus can connect successfully
- Graceful fallback if timeout exceeded
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Critical fix for service preservation:
Problem:
- After cleanup, Oracle service is stopped
- Start-Service attempts to start the instance automatically
- Without database files, service startup hangs indefinitely
- PowerShell Start-Service blocks waiting for service to start
Root Cause:
- Oracle service on Windows tries to auto-start the instance
- With no controlfile/database files, it cannot start
- Start-Service waits forever (user reported 25+ warnings)
Solution:
- Do NOT attempt to start the stopped service
- SQL*Plus can connect '/ as sysdba' even if service is stopped
- STARTUP NOMOUNT will manually start the instance
- This is the correct Oracle workflow for restore from zero
Windows SQL*Plus requirements:
✓ ORACLE_SID set (we set this)
✓ Service exists in registry (preserved after cleanup)
✓ ORACLE_HOME set (we set this)
✗ Service running NOT required for NOMOUNT startup
The service will naturally transition to Running state when
STARTUP NOMOUNT successfully starts the instance.
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Performance optimization for VM startup:
Before: Fixed 180s wait regardless of actual boot time
After: Intelligent polling with early exit when VM is ready
Implementation:
- Poll every 5 seconds (max 180s timeout)
- Check 1: VM running status in Proxmox (qm status)
- Check 2: SSH connectivity test
- Check 3: PowerShell availability (what we actually need)
- Exit immediately when all checks pass
- Progress logging every 30 seconds
- Fallback: Continue after 180s with warning
Benefits:
- Fast VM boot (30s) → saves 150s (2min 30s)
- Normal VM boot (60s) → saves 120s (2min)
- Slow VM boot → 180s (same as before)
- More robust: verifies SSH+PowerShell actually work
Average expected improvement: 60-120 seconds per test
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Critical fixes and improvements:
1. Database verification fix (robust):
- Use Select-String -Quiet to get True/False boolean
- Convert PowerShell boolean to bash-friendly format
- Check for 'READ WRITE' in entire sqlplus output
- Eliminates false negatives from text parsing issues
2. Collect FULL RMAN restore log:
- Removed -Head 200 limitation
- Now sends complete RMAN log in email
- Better debugging with full context
- Updated templates: "first 200 lines" → "complete"
3. Add bash script log to email notifications:
- Include last 100 lines of bash execution log
- Separate "RMAN Restore Log" and "Bash Script Log" sections
- Both text and HTML templates updated
- Shows script flow and any bash-level errors
This fixes the issue where 42,625 tables were restored successfully
but test reported FAILED due to query output format mismatch.
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Critical fixes for false negatives in DR test reporting:
1. Database verification fix:
- Changed from 'findstr' (CMD) to 'Select-String' (PowerShell native)
- findstr was failing in PowerShell context causing db_status to be empty
- Result: DB with 42,625 tables was incorrectly reported as FAILED
2. Restore log collection fix:
- Changed from 'type' (CMD) to 'Get-Content' (PowerShell native)
- type command doesn't work through SSH PowerShell context
- Added -ErrorAction SilentlyContinue for cleaner error handling
- Simplified fallback logic using [-z] instead of string matching
Both issues were caused by mixing CMD commands in PowerShell context.
Now uses PowerShell-native commands throughout for consistency.
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Adjust db_recovery_file_dest_size in auto-generated PFILE:
- Previous: 20G
- New: 50G
- Reason: Provide more space for RMAN restore operations and backups
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Enhancement to rman_restore_from_zero.ps1:
- Auto-generate initROA.ora if not found at service creation
- Uses exact tested configuration from initROA.ora:
- memory_target=1024M (tested DR VM allocation)
- _allow_resetlogs_corruption=TRUE (critical for DR restore!)
- control_files in oradata + recovery_area
- Standard Oracle 19c parameters for DR environment
Benefits:
- Script is now fully self-sufficient
- No manual PFILE setup required
- DR VM can be restored from completely clean state
- Uses battle-tested configuration
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Performance improvements:
- cleanup_database.ps1: Skip service deletion (saves ~25s per test)
- Remove oradim -delete, sc.exe delete, registry cleanup
- Add SPFILE deletion to ensure PFILE-based startup
- Service now persists between tests for reuse
- rman_restore_from_zero.ps1: Smart service check (saves ~15s per test)
- Check if service exists before creating
- Skip oradim -new if service already present
- Only create service on first run or if missing
Total time savings: ~40 seconds per weekly DR test
Service lifecycle: Created once, reused indefinitely until manual cleanup
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Fix CROSSCHECK BACKUP command to execute after database is mounted
- Correct CATALOG command to use recovery_area instead of F:\ path
- Add robust backup file validation with detailed error reporting
- Improve file-by-file backup copying with individual error tracking
- Enhance restore log collection for both success and failure scenarios
- Fix database verification to check OPEN_MODE instead of STATUS
- Add comprehensive directory and permissions error handling
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
rman_restore_from_zero.ps1:
- Add -TestMode switch parameter
- TestMode (weekly DR test): Skip service/listener config, only verify restore works
- Standalone mode: Full config with SPFILE + Listener for production use
weekly-dr-test-proxmox.sh:
- Call restore script with -TestMode flag
- Avoids service recreation and SSH disconnect during tests
Benefits:
- Weekly tests are faster and cleaner (no service restart)
- Manual restore prepares system for production use
- No more 'Broken pipe' errors during tests
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Remove service delete/recreate at step 3.3 that was causing 'Broken pipe' error
Service is already configured with auto-start at step 2.1 - no need to recreate
Issue: oradim -delete was killing running database and breaking SSH connection
Solution: Skip recreation, service already has correct auto-start configuration
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
weekly-dr-test-proxmox.sh:
- Replace Unix commands (echo, grep) with PowerShell equivalents
- Use PowerShell Select-String for database status verification
- Fix table count query to work properly through SSH
rman_restore_from_zero.ps1:
- Set Oracle service to AUTOMATIC startup (was manual)
- Set Listener service to AUTOMATIC startup
- Auto-start Listener after database restore
- Add fallback to lsnrctl if service start fails
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Changes:
- Remove obsolete .cmd scripts (cleanup_database.cmd, rman_restore_from_zero.cmd, rman_restore_final.cmd)
- Update weekly-dr-test-proxmox.sh to call PowerShell scripts with /SILENT parameter
- Add initROA.ora configuration file for reference
All DR test scripts now use PowerShell for SSH compatibility
Resolves input redirection issues with Windows SSH
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Add cleanup_database.ps1: PowerShell version without input redirection issues
- Add rman_restore_from_zero.ps1: PowerShell version with inline SQL commands
- Update weekly-dr-test-proxmox.sh: Call .ps1 scripts via PowerShell
PowerShell scripts resolve SSH 'Input redirection not supported' errors
All SQL commands are piped directly to sqlplus (no temp files needed)
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Add rman_backup.bat: FULL backup with live console output and log file
- Add rman_backup_incremental.bat: INCREMENTAL backup with live output
- Add rman_backup.txt: RMAN script for LEVEL 0 FULL backup
- Add rman_backup_incremental.txt: RMAN script for LEVEL 1 CUMULATIVE backup
- Scripts are portable: use current directory instead of hardcoded paths
- Logging: simultaneous output to console AND log file using PowerShell Tee-Object
- Log files saved in logs/ subdirectory with timestamps
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Convert complex HTML/CSS templates to plain text format for Gmail compatibility
- Replace decorative characters (box drawing, special symbols) with simple text
- Use single-line bullet points instead of complex table layouts
- Improve readability across all email clients (Gmail, Outlook, mobile)
- Remove HTML templates completely, use only text format
- Keep informative structure with clear section separators
- Both text and HTML templates now identical for consistency
- Critical for Gmail users who only see plain text formatting
New format works perfectly in Gmail:
Oracle Backup WARNING - pveelite
WARNING
========================================
WARNINGS:
- FULL backup is 51 hours old (threshold: 25)
========================================
BACKUP STATUS:
FULL: 51h old TOO OLD (limit: 25h)
CUMULATIVE: 4h old OK (limit: 7h)
Total: 12 files | Size: 6.3GB | Disk: 2%
========================================
Next check: 2025-10-10 + 24h | Proxmox Monitoring
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Fix Proxmox template compatibility: {{hostname}} → {{node}}, {{timestamp}} → {{date}}
- Remove duplicate node fields and fix JSON structure
- Complete full testing plan execution for monitoring and DR test scripts
- Validate notification system functionality with PVE::Notify
- Sync tested scripts from production back to repository
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Remove outdated planning documents and implementation guides
- Update README with comprehensive DR procedures and monitoring
- Enhance rman_restore_from_zero.cmd with SPFILE creation and auto-start
- Add Proxmox monitoring and weekly test scripts
- Archive old implementation documentation
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
COMPLETED:
- Phase 1: Proxmox host storage (/mnt/pve/oracle-backups/ROA/autobackup)
- Phase 2: RMAN script already has CUMULATIVE keyword
- Phase 3: Transfer scripts updated for Proxmox host
* transfer_incremental.ps1: 10.0.20.37:22122 → 10.0.20.202:22
* transfer_to_dr.ps1: Same change
* Converted Windows PowerShell to Linux bash commands
- VM 109 cleanup: ~6.4 GB freed, RMAN catalog cleaned
NEW FILES:
- copy_existing_key_to_proxmox.ps1: Setup script for SSH key
- setup_ssh_keys_for_proxmox.ps1: Alternative setup (not used)
PENDING (Next Session):
- Run copy_existing_key_to_proxmox.ps1 on PRIMARY as Administrator
- Phase 4: Modify scheduled tasks (13:00 + 18:00)
- Phase 5: Configure mount point on VM 109 (F:\ drive)
- Phase 6: Update restore script for F:\ mount
- Phase 7: Test FULL + CUMULATIVE backup and restore
DOCUMENTATION:
- DR_UPGRADE_TO_CUMULATIVE_PLAN.md: Added implementation status
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
Major changes:
- Implemented Windows VM 109 as DR target (replaces Linux LXC)
- Tested RMAN restore successfully (12-15 min RTO, 24h RPO)
- Added comprehensive DR documentation:
* DR_WINDOWS_VM_STATUS_2025-10-09.md - Current implementation status
* DR_UPGRADE_TO_CUMULATIVE_PLAN.md - Plan for cumulative incremental backups
* DR_VM_MIGRATION_GUIDE.md - Guide for VM migration between Proxmox nodes
- Updated DR_WINDOWS_VM_IMPLEMENTATION_PLAN.md with completed phases
New scripts:
- add_system_key_dr.ps1 - SSH key setup for automated transfers
- configure_listener_dr.ps1 - Oracle Listener configuration
- fix_ssh_via_service.ps1 - SSH authentication fix
- rman_restore_final.cmd - Working RMAN restore script (tested)
- transfer_to_dr.ps1 - FULL backup transfer (renamed from 02_*)
- transfer_incremental.ps1 - Incremental backup transfer (renamed from 02b_*)
Cleanup:
- Removed 19 obsolete scripts for Linux LXC DR
- Removed 8 outdated documentation files
- Organized project structure
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
The save_config function was using json.dump with default ensure_ascii=True,
which converted all Romanian characters to \uXXXX escape sequences, making
the entire ~/.claude.json file appear modified even when only changing MCP
server configuration for a specific project.
Added ensure_ascii=False to preserve original UTF-8 encoding and minimize
file changes to only the intended MCP server modifications.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add claude-mcp-toggle: CLI tool for managing MCP servers
- Enable/disable individual MCP servers
- Enable/disable all servers
- Set specific servers (disable all, enable selected)
- Interactive mode with menu
- List servers with enabled/disabled status
- Add comprehensive README with usage examples
- Add Oracle DR restore troubleshooting documentation
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add comprehensive Oracle backup and DR strategy documentation
- Add RMAN backup scripts (full and incremental)
- Add PowerShell transfer scripts for DR site
- Add bash restore and verification scripts
- Reorganize Oracle documentation structure
- Add Proxmox troubleshooting guide for VM 201 HA errors and NFS storage issues
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
The ups-shutdown-cluster.sh script was missing LXC container shutdown
functionality, only shutting down VMs. This could leave containers
running during UPS power failure, causing ungraceful shutdown.
Changes:
- Added Step 2: LXC container shutdown on all cluster nodes
- Uses 'pct list' to find running containers
- Shuts down each container with 60s timeout
- Parallel shutdown with '&' for speed
- Both local (pvemini) and remote nodes (pve1, pveelite)
- Updated step numbers (now 6 steps total vs 5 before)
- Fixed log_message() to use dynamic timestamp
- Fixed node name comment (pve2 → pveelite)
Shutdown order:
1. VMs on all nodes (timeout 60s)
2. LXC containers on all nodes (timeout 60s) [NEW]
3. Wait 90 seconds for graceful shutdown
4. Secondary nodes shutdown (pve1, pveelite)
5. Wait 30 seconds
6. Primary node shutdown (pvemini)
This matches the behavior in ups-maintenance-shutdown.sh which already
had LXC support.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Adds complete procedure for replacing UPS battery when entire cluster
is powered by the same UPS, requiring full cluster shutdown.
New files:
- scripts/ups-maintenance-shutdown.sh: Automated orchestrated shutdown
for maintenance operations with confirmation prompts and progress display
- docs/UPS-BATTERY-REPLACEMENT.md: Complete step-by-step guide for battery
replacement including pre-shutdown, physical replacement, and post-startup
verification procedures
Features:
- Orchestrated shutdown: VMs → LXC containers → secondary nodes → primary
- Interactive confirmation before shutdown
- Color-coded progress indicators
- Countdown timers for each phase
- Post-replacement verification checklist
- Troubleshooting guide for common issues
- Recovery procedures for cluster/quorum problems
The procedure accounts for all 3 cluster nodes (pve1, pvemini, pveelite)
being on the same UPS, requiring complete infrastructure shutdown.
Documentation includes:
- When to replace battery (based on monthly test results)
- Pre-planning and user notification templates
- Physical battery replacement safety procedures
- Cluster recovery and VM restart procedures
- Post-replacement testing and verification
- 24-hour and 1-week monitoring checklists
Estimated maintenance window: 30-60 minutes
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
This commit adds a comprehensive UPS monitoring and management system for
the Proxmox cluster with automated shutdown orchestration and monthly
battery health testing.
Features:
- NUT (Network UPS Tools) configuration for INNO TECH USB UPS
- Automated cluster shutdown on power failure (3-minute grace period)
- Monthly automated battery testing with health evaluation
- Email notifications via PVE::Notify system
- WinNUT monitoring client for Windows VM 201
Components added:
- config/: NUT configuration files (ups.conf, upsd.conf, upsmon.conf, etc.)
- scripts/ups-shutdown-cluster.sh: Orchestrated cluster shutdown
- scripts/ups-monthly-test.sh: Monthly battery test with email reports
- scripts/upssched-cmd: Event handler for UPS state changes
- docs/: Complete installation and usage documentation
Key findings:
- UPS battery.charge reporting has 10-40 second delay after test start
- Test must monitor voltage drop (1.5-2V) and charge drop (9-27%)
- Battery health evaluation: EXCELLENT/GOOD/FAIR/POOR based on discharge rate
- Email notifications use Handlebars templates without Unicode emojis for compatibility
Configuration:
- UPS: INNO TECH (Voltronic protocol, vendor 0665:5161)
- Primary node: pvemini (10.0.20.201) with USB connection
- Monthly test: cron 0 0 1 * * /opt/scripts/ups-monthly-test.sh
- Shutdown timer: 180 seconds on battery before cluster shutdown
Documentation includes complete installation guides for NUT server,
WinNUT client, and troubleshooting procedures.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Created scrie_jc_2007_oracle10g.sql with FOR LOOP instead of FORALL
- Resolves PLS-00436 error on Oracle 10.2.0.5 and older versions
- Added README_ORACLE10G.md with technical documentation
- Added INSTRUCTIUNI_ORACLE10G.txt with client instructions
- Main version (scrie_jc_2007.sql) remains optimized with FORALL for Oracle 11g+
Performance: ~20-50ms for <10k rows (vs 15-30ms FORALL, but 2400x faster than old MERGE)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Replace row-by-row processing with FORALL bulk UPDATE/INSERT/DELETE
- Improve readability: l_data(i) → S(i), l_data(l_insert_indices(i)) → SI(i)
- Use dedicated collections: S (source), SI (insert), SD (delete)
- Reduce context switches from 3*N to 3 operations
- Performance improvement: ~15-30ms vs ~80-120ms (3-4x faster for 10k rows)
- Maintain exact same business logic as original implementation
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>