69 Commits

Author SHA1 Message Date
Marius
91b9e08e9d adunare generala 2025-12-20 22:23:11 +02:00
Marius
90d77704d6 Reorganize Proxmox documentation with clear structure and VM/LXC mapping
## Changes

### Documentation Reorganization
- **README.md**: Complete restructure with logical sections
  - Infrastructure General (proxmox-ssh-guide.md)
  - LXC Containers (oracle-database-lxc108.md)
  - Virtual Machines (vm201-*.md)
  - Cluster-Wide Resources (cluster-ha-monitor.sh, ups/)
  - Archived/Decommissioned (archived-vm107-monitor.sh)
  - Added quick navigation "Am nevoie să..." section
  - Added recommended workflows
  - Added complete directory structure map

- **proxmox-ssh-guide.md**: Added documentation references section
  - Clear links to all related documentation
  - When to use each document
  - Quick start snippets for each resource

### File Renames for Clarity
- `certificat-letsencrypt-iis.md` → `vm201-certificat-letsencrypt-iis.md`
- `troubleshooting-vm201-backup-nfs.md` → `vm201-troubleshooting-backup-nfs.md`
- `ha-monitor.sh` → `cluster-ha-monitor.sh`
- `vm107-monitor.sh` → `archived-vm107-monitor.sh`

### New Documentation
- **vm201-windows11.md**: Complete VM 201 documentation
  - Hardware configuration
  - Installed services (IIS, SQL*Plus, WinNUT, RDP)
  - Network configuration
  - Backup and recovery procedures
  - Common troubleshooting

## Benefits
- Clear naming convention: VM/LXC/Cluster prefixes
- Central index in README.md with navigation
- Cross-references between documents
- Complete VM 201 documentation suite
- Clear archival of decommissioned resources

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-19 13:43:44 +02:00
Marius
cd7b2ed9e7 Clarify storage configuration and fix node names
Storage Configuration improvements:
- Add "Noduri" column showing which nodes have access to each storage
- Clarify that 'local' is separate on each node (non-shared)
- Clarify that 'local-zfs' is shared across pvemini, pve1, pveelite
- Clarify that 'backup' is only on pvemini (10.0.20.201)
- Add detailed explanations for each storage type
- Add storage paths section with important locations

Node name corrections:
- Fix node name: pve2 → pveelite (correct cluster name)
- Update all references across proxmox-ssh-guide.md and README.md
- Add node descriptions in tables for clarity

Benefits:
- Users now know exactly which storage is available on which nodes
- Clear distinction between shared and non-shared storage
- Correct node naming throughout documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-19 13:01:08 +02:00
Marius
f1b982794b Reorganize Oracle and Proxmox documentation structure
- Move oracle/CONEXIUNI-ORACLE.md → proxmox/oracle-database-lxc108.md
- Create proxmox/README.md as documentation index
- Update proxmox-ssh-guide.md:
  * Remove VM 107 references (decommissioned)
  * Update LXC and VM tables with IP addresses
  * Add IP address map for all services
  * Simplify Oracle section (detailed info in oracle-database-lxc108.md)
  * Update backup job configuration

Benefits:
- All infrastructure docs in proxmox/ directory
- Clear separation: general Proxmox (proxmox-ssh-guide.md) vs Oracle-specific (oracle-database-lxc108.md)
- No duplicate information between files
- Easy navigation with README.md index

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-19 12:58:22 +02:00
Marius
b4c2a24281 Fix Oracle DR test ORA-00600 error by forcing service shutdown in cleanup
Problem: DR weekly test failed with ORA-00600 [kcbzib_kcrsds_1] when executed
via cron, but succeeded when run manually. Error occurred during "ALTER DATABASE
OPEN RESETLOGS" step after successful restore and recovery.

Root Cause Analysis:
- Manual test (12:09): Undo initialization = 0ms, no errors
- Cron test (10:45): Undo initialization = 2735ms, ORA-00600 crash
- Alert log showed: "Undo initialization recovery: err:600"
- Oracle instance was in inconsistent state from previous run

The cleanup_database.ps1 script had an "optimization" that preserved the
running Oracle service to "save ~30s startup time". This left the service
in an inconsistent state between test runs, causing Oracle to crash when
attempting to open the database with RESETLOGS.

Solution:
Modified cleanup_database.ps1 to ALWAYS stop Oracle service completely:
1. SHUTDOWN ABORT the instance (not just when /AFTER flag)
2. Stop-Service OracleServiceROA (force clean state)
3. Kill remaining oracle processes
4. Service starts fresh during restore (clean Undo initialization)

Changes:
- Removed if/else branch that skipped shutdown before restore
- Always perform full shutdown regardless of /AFTER parameter
- Updated messages to reflect clean state approach
- Added explanation: "This ensures no state inconsistencies (prevents ORA-00600)"

Testing: Manual test confirmed clean 0ms Undo initialization after fix.

Related: Works in conjunction with weekly-dr-test-proxmox.sh PATH fix (commit 34f91ba)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-06 12:25:38 +02:00
Marius
34f91ba206 Fix Oracle DR test cron execution by adding explicit PATH
Problem: The weekly DR test script worked when run manually but failed
when executed via cron with "Failed to start VM 109" error at 0 seconds.

Cause: Cron jobs run with a minimal PATH that doesn't include /usr/sbin
where Proxmox commands (qm, pvesh, etc.) are located. Manual execution
had the full PATH including /usr/sbin.

Solution: Added explicit PATH export at the start of the script to ensure
all required system binaries are accessible regardless of execution context.

Testing: Successfully verified with cron test at 11:32 - VM started properly,
restore process completed normally.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-06 11:37:27 +02:00
Marius
c715a0a89d transfer backups 2025-10-31 01:19:15 +02:00
Marius
63bcdf5c7f copiere backup 2025-10-31 01:05:31 +02:00
Marius
13a7cd6d96 IIS SSL certificat 2025-10-26 18:52:44 +02:00
Marius
bc75ce30c2 Add chatbot documentation and Claude agent SDK resources 2025-10-21 16:07:35 +03:00
Marius
132b4fb34b Proxmox HA: Fix false FAILED alerts and suppress cron notification emails
Fixed two critical issues with HA monitoring:
1. False positive quorum errors - corosync-quorumtool not in cron PATH
2. Unwanted cron emails from PVE::Notify INFO messages to STDERR

Changes:
- Set proper PATH including /usr/sbin for corosync-quorumtool
- Split notification code: verbose shows all, non-verbose redirects STDERR to /dev/null
- Prevents cron from sending duplicate notification emails

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-16 13:35:43 +03:00
Marius
8bb494c60e Oracle DR: Fix backup retention to keep exactly 2 days instead of 3
Changed -mtime logic from +$RetentionDays to +($RetentionDays - 1) to correctly implement 2-day retention. Previously kept 3 days (today + 2 previous), now keeps exactly 2 days (today + yesterday).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-15 22:54:02 +03:00
Marius
b50cc2b8c4 Oracle DR: Fix backup retention and monitoring for new naming convention
Problem: Backups accumulated on DR (73 files, 4 days) instead of keeping only 2 days
- transfer_incremental.ps1 had no cleanup function (ran 2x/day without cleanup)
- transfer_to_dr.ps1 cleanup had poor logging
- oracle-backup-monitor-proxmox.sh couldn't detect new L0/L1 backup format

Changes:
- Add cleanup to transfer_incremental.ps1 (delete backups older than 2 days)
- Improve cleanup logging in transfer_to_dr.ps1 (shows count before/after)
- Update oracle-backup-monitor-proxmox.sh to detect both naming conventions:
  * Old: *FULL*.BKP, *INCR*.BKP
  * New: L0_*.BKP (Level 0), L1_*.BKP (Level 1)
- Remove temporary files from /input/ directory

Result: Monitor now correctly reports backup age, cleanup runs after each transfer

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-14 18:05:11 +03:00
Marius
249bf4d98a debugging 2025-10-11 19:18:37 +03:00
Marius
1b523c1624 Oracle DR: Add comprehensive restore test debugging guide to README
- Add section 'Debugging Restore Tests' with practical troubleshooting commands
- Check backup files on Proxmox: list, count, verify timestamps
- Verify backup files on DR VM: NFS mount, file counts, sizes
- Check DR test results: parse logs for PASSED/FAILED status
- Simulate test locally: manual restore steps for debugging
- Common issues table with checks and fixes
- Verify naming convention is active (L0_*, L1_* format)
- Manual test run with verbose output for real-time monitoring

Helps diagnose issues like:
- False FAILED notifications
- Missing datafiles
- RMAN-06023 errors
- Backup selection problems

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 19:03:08 +03:00
Marius
8da1208ca7 Oracle DR: Fix false FAILED notification - parse database status from log
- Replace complex SSH+PowerShell query with simple log file parsing
- rman_restore_from_zero.ps1 already verifies and outputs database status
- Parse 'OPEN_MODE: READ WRITE' and 'TABLES: <count>' from LOG_FILE
- Fixes issue where successful restore was reported as FAILED
- More reliable: avoids SSH escaping issues with Select-String -Quiet

Root cause: SSH+PowerShell+sqlplus+Select-String chain was too fragile and
returned empty/false even when database was successfully opened (42625 tables).

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 18:55:05 +03:00
Marius
a7273d1820 Oracle RMAN: Fix backup location - add full path to FORMAT
- Add full path to FORMAT in rman_backup.txt and rman_backup_incremental.txt
- Files now stored in C:\Users\oracle\recovery_area\ROA\autobackup- Fixes issue where backups were created in ORACLE_HOME\DATABASE instead of recovery area
- Ensures transfer_to_dr.ps1 can find and transfer all backups correctly

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 16:26:56 +03:00
Marius
62848e635d Oracle DR: Add naming convention to RMAN backups for smart restore selection
- Add FORMAT to rman_backup.txt: L0_*, ARC_*, SPFILE_*, CF_*
- Add FORMAT to rman_backup_incremental.txt: L1_*, ARC_*, SPFILE_*, CF_*
- Update rman_restore_from_zero.ps1 TestMode to select files by naming convention
- Select only latest L0 backup set + all L1 incrementals/archives (faster DR tests)
- Backward compatible with old autobackup naming (fallback to copy all)
- Fixes missing datafiles issue (previously only copied 8 files, now copies full backup set)

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 16:12:41 +03:00
Marius
f1002d6e4a Oracle DR: Add /AFTER parameter to cleanup - smart shutdown based on context
Critical fix based on user analysis:

PROBLEM:
Cleanup is called in 2 contexts with different requirements:
1. BEFORE restore (from rman_restore): Should NOT shutdown
2. AFTER restore (from weekly-test): MUST shutdown to delete files

USER INSIGHT:
"Why shutdown if restore will clean anyway? But AFTER restore,
you MUST shutdown to release file locks for deletion!"

SOLUTION:
Add /AFTER parameter to cleanup_database.ps1:

WITHOUT /AFTER (before restore):
- Skip SHUTDOWN ABORT
- Skip Stop-Service
- Leave service in current state (running/stopped)
- Files CAN be deleted (no lock before restore)
- Optimization: If service running → restore saves ~30s

WITH /AFTER (after restore):
- SHUTDOWN ABORT (stop instance)
- Stop-Service (release file locks)
- REQUIRED for file deletion after restore
- Files are locked by active instance/service

CALL SITES:
1. rman_restore: cleanup_database.ps1 /SILENT (no /AFTER)
2. weekly-test: cleanup_database.ps1 /SILENT /AFTER (with /AFTER)

FLOW OPTIMIZATION:
Test 1: Service stopped → start(30s) → restore → cleanup /AFTER
Test 2: Service stopped → start(30s) → restore → cleanup /AFTER
→ No improvement yet

BUT if we keep service running between tests:
Test 1: Service stopped → start(30s) → restore → cleanup /AFTER
Test 2: Service running → restore(0s saved!) → cleanup /AFTER
→ Save 30s on subsequent tests!

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 15:34:00 +03:00
Marius
5af33fc217 Oracle DR: Add SHUTDOWN ABORT before NOMOUNT to clean auto-started instance
Critical fix for service auto-start behavior:

Problem (identified by user):
- When Oracle service starts, it automatically tries to start instance
- Uses configured PFILE which references control files
- After cleanup, control files don't exist
- Instance ends up in partial/error state
- STARTUP NOMOUNT may fail or behave unexpectedly

Root Cause:
- Oracle service on Windows has auto-start behavior
- Service startup takes ~30s trying to start instance
- Without valid control files, instance is partially started
- This interferes with manual STARTUP NOMOUNT

Solution:
Before STARTUP NOMOUNT, explicitly clean any existing instance:
```sql
SHUTDOWN ABORT;  -- Clean any partial instance
STARTUP NOMOUNT PFILE='...';  -- Fresh clean start
```

Implementation:
- Use WHENEVER SQLERROR CONTINUE (SHUTDOWN may error if no instance)
- Explicit SHUTDOWN ABORT before NOMOUNT
- Ensures clean instance state for RMAN restore
- Service running + clean NOMOUNT instance = ready for restore

User requirement met: Instance in NOMOUNT state (not mounted/open)

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 15:18:45 +03:00
Marius
4b7bb29b9e Oracle DR: Fix service start with polling and timeout (not blocking)
Critical fix - service MUST be running for SQL*Plus connection:

Problem (confirmed by user):
- After cleanup, service is stopped
- sqlplus / as sysdba → ORA-12560: TNS:protocol adapter error
- Start-Service blocks indefinitely (user saw 25+ warnings)
- Service takes ~30 seconds to start

Previous attempt (WRONG):
- Assumed SQL*Plus works with stopped service ✗
- User proved ORA-12560 occurs when service stopped ✓

Correct Solution:
- Start service in background job (non-blocking)
- Poll service status every 3 seconds
- Timeout after 60 seconds (2x expected startup time)
- Progress logging every 15 seconds
- Cleanup background job when done

Implementation:
```powershell
Start-Job { Start-Service OracleServiceROA }
while (elapsed < 60s) {
    if (service.Status == Running) → break
    sleep 3s
}
```

Result:
- Service starts in ~30s (user confirmed)
- Script doesn't block
- SQL*Plus can connect successfully
- Graceful fallback if timeout exceeded

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 15:14:24 +03:00
Marius
e4df4c11d8 Oracle DR: Fix service start hang - don't start stopped service
Critical fix for service preservation:

Problem:
- After cleanup, Oracle service is stopped
- Start-Service attempts to start the instance automatically
- Without database files, service startup hangs indefinitely
- PowerShell Start-Service blocks waiting for service to start

Root Cause:
- Oracle service on Windows tries to auto-start the instance
- With no controlfile/database files, it cannot start
- Start-Service waits forever (user reported 25+ warnings)

Solution:
- Do NOT attempt to start the stopped service
- SQL*Plus can connect '/ as sysdba' even if service is stopped
- STARTUP NOMOUNT will manually start the instance
- This is the correct Oracle workflow for restore from zero

Windows SQL*Plus requirements:
✓ ORACLE_SID set (we set this)
✓ Service exists in registry (preserved after cleanup)
✓ ORACLE_HOME set (we set this)
✗ Service running NOT required for NOMOUNT startup

The service will naturally transition to Running state when
STARTUP NOMOUNT successfully starts the instance.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 15:12:21 +03:00
Marius
4256d5a914 Oracle DR: Optimize backup copy - TestMode only copies latest backup set
Major performance optimization for weekly DR tests:

TestMode (weekly testing):
- Copy ONLY latest full backup + everything after it
- Includes: latest DAILY_FULL + incrementals + controlfiles + SPFILE
- Excludes: older full backups (not needed for testing)
- Benefit: ~60-70% reduction (14GB → 4-5GB)
- Copy time: 2min → 30-45sec (saves ~1-1.5 min)
- Risk: Low - testing only needs to verify latest backup works

Standalone Mode (real DR):
- Copy ALL backups (unchanged behavior)
- Includes: all full backups + redundancy for fallback
- Benefit: Maximum safety for disaster recovery
- If latest backup corrupted → RMAN uses previous backup

Implementation:
- Finds latest *DAILY_FULL*.BKP (Level 0 backup)
- Gets its timestamp
- Copies all *.BKP files >= that timestamp
- Automatic inclusion of incrementals, controlfiles, SPFILE backups

Combined optimization results:
- VM polling: saves 60-120s
- Service preservation: saves 40s
- Backup copy (TestMode): saves 60-90s
Total: 160-250 seconds (2.5-4 minutes) per test

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 14:58:12 +03:00
Marius
5750b42836 Oracle DR: Replace fixed VM boot wait with intelligent polling
Performance optimization for VM startup:

Before: Fixed 180s wait regardless of actual boot time
After: Intelligent polling with early exit when VM is ready

Implementation:
- Poll every 5 seconds (max 180s timeout)
- Check 1: VM running status in Proxmox (qm status)
- Check 2: SSH connectivity test
- Check 3: PowerShell availability (what we actually need)
- Exit immediately when all checks pass
- Progress logging every 30 seconds
- Fallback: Continue after 180s with warning

Benefits:
- Fast VM boot (30s) → saves 150s (2min 30s)
- Normal VM boot (60s) → saves 120s (2min)
- Slow VM boot → 180s (same as before)
- More robust: verifies SSH+PowerShell actually work

Average expected improvement: 60-120 seconds per test

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 14:30:32 +03:00
Marius
835d8b465b Oracle DR: Fix database verification, add bash log, and collect full RMAN log
Critical fixes and improvements:

1. Database verification fix (robust):
   - Use Select-String -Quiet to get True/False boolean
   - Convert PowerShell boolean to bash-friendly format
   - Check for 'READ WRITE' in entire sqlplus output
   - Eliminates false negatives from text parsing issues

2. Collect FULL RMAN restore log:
   - Removed -Head 200 limitation
   - Now sends complete RMAN log in email
   - Better debugging with full context
   - Updated templates: "first 200 lines" → "complete"

3. Add bash script log to email notifications:
   - Include last 100 lines of bash execution log
   - Separate "RMAN Restore Log" and "Bash Script Log" sections
   - Both text and HTML templates updated
   - Shows script flow and any bash-level errors

This fixes the issue where 42,625 tables were restored successfully
but test reported FAILED due to query output format mismatch.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 14:25:58 +03:00
Marius
12700261c7 Oracle DR: Fix database verification and restore log collection
Critical fixes for false negatives in DR test reporting:

1. Database verification fix:
   - Changed from 'findstr' (CMD) to 'Select-String' (PowerShell native)
   - findstr was failing in PowerShell context causing db_status to be empty
   - Result: DB with 42,625 tables was incorrectly reported as FAILED

2. Restore log collection fix:
   - Changed from 'type' (CMD) to 'Get-Content' (PowerShell native)
   - type command doesn't work through SSH PowerShell context
   - Added -ErrorAction SilentlyContinue for cleaner error handling
   - Simplified fallback logic using [-z] instead of string matching

Both issues were caused by mixing CMD commands in PowerShell context.
Now uses PowerShell-native commands throughout for consistency.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 11:11:37 +03:00
Marius
c4504cac70 Oracle DR: Increase recovery area size to 50G
Adjust db_recovery_file_dest_size in auto-generated PFILE:
- Previous: 20G
- New: 50G
- Reason: Provide more space for RMAN restore operations and backups

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 11:06:33 +03:00
Marius
eade344f28 Oracle DR: Auto-create PFILE if missing using tested configuration
Enhancement to rman_restore_from_zero.ps1:
- Auto-generate initROA.ora if not found at service creation
- Uses exact tested configuration from initROA.ora:
  - memory_target=1024M (tested DR VM allocation)
  - _allow_resetlogs_corruption=TRUE (critical for DR restore!)
  - control_files in oradata + recovery_area
  - Standard Oracle 19c parameters for DR environment

Benefits:
- Script is now fully self-sufficient
- No manual PFILE setup required
- DR VM can be restored from completely clean state
- Uses battle-tested configuration

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 11:05:17 +03:00
Marius
5bed910b1c Oracle DR: Optimize test speed by preserving service between tests
Performance improvements:
- cleanup_database.ps1: Skip service deletion (saves ~25s per test)
  - Remove oradim -delete, sc.exe delete, registry cleanup
  - Add SPFILE deletion to ensure PFILE-based startup
  - Service now persists between tests for reuse

- rman_restore_from_zero.ps1: Smart service check (saves ~15s per test)
  - Check if service exists before creating
  - Skip oradim -new if service already present
  - Only create service on first run or if missing

Total time savings: ~40 seconds per weekly DR test
Service lifecycle: Created once, reused indefinitely until manual cleanup

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 10:59:43 +03:00
Marius
3a51880c9e Oracle DR: Fix RMAN crosscheck sequence and improve error handling
- Fix CROSSCHECK BACKUP command to execute after database is mounted
- Correct CATALOG command to use recovery_area instead of F:\ path
- Add robust backup file validation with detailed error reporting
- Improve file-by-file backup copying with individual error tracking
- Enhance restore log collection for both success and failure scenarios
- Fix database verification to check OPEN_MODE instead of STATUS
- Add comprehensive directory and permissions error handling

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 10:32:49 +03:00
Marius
9ed0ee9e0e Oracle DR: Add TestMode parameter for dual behavior
rman_restore_from_zero.ps1:
- Add -TestMode switch parameter
- TestMode (weekly DR test): Skip service/listener config, only verify restore works
- Standalone mode: Full config with SPFILE + Listener for production use

weekly-dr-test-proxmox.sh:
- Call restore script with -TestMode flag
- Avoids service recreation and SSH disconnect during tests

Benefits:
- Weekly tests are faster and cleaner (no service restart)
- Manual restore prepares system for production use
- No more 'Broken pipe' errors during tests

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 02:33:43 +03:00
Marius
f79331f7cc Oracle DR: Fix service recreation causing SSH disconnect
Remove service delete/recreate at step 3.3 that was causing 'Broken pipe' error
Service is already configured with auto-start at step 2.1 - no need to recreate

Issue: oradim -delete was killing running database and breaking SSH connection
Solution: Skip recreation, service already has correct auto-start configuration

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 02:29:38 +03:00
Marius
6accd1f996 Oracle DR: Fix verification commands and auto-start services
weekly-dr-test-proxmox.sh:
- Replace Unix commands (echo, grep) with PowerShell equivalents
- Use PowerShell Select-String for database status verification
- Fix table count query to work properly through SSH

rman_restore_from_zero.ps1:
- Set Oracle service to AUTOMATIC startup (was manual)
- Set Listener service to AUTOMATIC startup
- Auto-start Listener after database restore
- Add fallback to lsnrctl if service start fails

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 02:03:57 +03:00
Marius
026b0436ba Oracle DR: Complete migration to PowerShell scripts and cleanup
Changes:
- Remove obsolete .cmd scripts (cleanup_database.cmd, rman_restore_from_zero.cmd, rman_restore_final.cmd)
- Update weekly-dr-test-proxmox.sh to call PowerShell scripts with /SILENT parameter
- Add initROA.ora configuration file for reference

All DR test scripts now use PowerShell for SSH compatibility
Resolves input redirection issues with Windows SSH

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 01:47:39 +03:00
Marius
2f8c927bbe Oracle DR: Convert restore scripts to PowerShell for SSH compatibility
- Add cleanup_database.ps1: PowerShell version without input redirection issues
- Add rman_restore_from_zero.ps1: PowerShell version with inline SQL commands
- Update weekly-dr-test-proxmox.sh: Call .ps1 scripts via PowerShell

PowerShell scripts resolve SSH 'Input redirection not supported' errors
All SQL commands are piped directly to sqlplus (no temp files needed)

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 01:26:35 +03:00
Marius
3c0beda819 Oracle DR: Add RMAN backup scripts with enhanced logging
- Add rman_backup.bat: FULL backup with live console output and log file
- Add rman_backup_incremental.bat: INCREMENTAL backup with live output
- Add rman_backup.txt: RMAN script for LEVEL 0 FULL backup
- Add rman_backup_incremental.txt: RMAN script for LEVEL 1 CUMULATIVE backup
- Scripts are portable: use current directory instead of hardcoded paths
- Logging: simultaneous output to console AND log file using PowerShell Tee-Object
- Log files saved in logs/ subdirectory with timestamps

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-10 23:41:42 +03:00
Marius
839f1b6b82 Oracle DR: Enhance notification templates with compact HTML layouts and improved data collection
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-10 22:41:32 +03:00
Marius
6f56e61b04 Oracle DR: Fix Gmail compatibility with plain text email templates
- Convert complex HTML/CSS templates to plain text format for Gmail compatibility
- Replace decorative characters (box drawing, special symbols) with simple text
- Use single-line bullet points instead of complex table layouts
- Improve readability across all email clients (Gmail, Outlook, mobile)
- Remove HTML templates completely, use only text format
- Keep informative structure with clear section separators
- Both text and HTML templates now identical for consistency
- Critical for Gmail users who only see plain text formatting

New format works perfectly in Gmail:
Oracle Backup WARNING - pveelite
WARNING

========================================
WARNINGS:
- FULL backup is 51 hours old (threshold: 25)

========================================
BACKUP STATUS:
FULL: 51h old TOO OLD (limit: 25h)
CUMULATIVE: 4h old OK (limit: 7h)
Total: 12 files | Size: 6.3GB | Disk: 2%

========================================
Next check: 2025-10-10 + 24h | Proxmox Monitoring

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-10 17:41:33 +03:00
Marius
b34006a499 Oracle DR: Fix template variables and complete monitoring system testing
- Fix Proxmox template compatibility: {{hostname}} → {{node}}, {{timestamp}} → {{date}}
- Remove duplicate node fields and fix JSON structure
- Complete full testing plan execution for monitoring and DR test scripts
- Validate notification system functionality with PVE::Notify
- Sync tested scripts from production back to repository

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-10 17:17:27 +03:00
Marius
b44e3c8f9b Oracle DR: Complete cleanup and restore scripts with Proxmox integration
- Remove outdated planning documents and implementation guides
- Update README with comprehensive DR procedures and monitoring
- Enhance rman_restore_from_zero.cmd with SPFILE creation and auto-start
- Add Proxmox monitoring and weekly test scripts
- Archive old implementation documentation

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-10 15:13:29 +03:00
Marius
cbad9ee779 Oracle DR: Phase 6.5 - Complete cleanup and restore scripts (TESTING)
Major improvements to DR restore workflow:

**New Scripts:**
- cleanup_database.cmd: Complete cleanup using oradim + registry deletion
- rman_restore_from_zero.cmd: Copy backups to recovery_area + restore

**Key Solutions Implemented:**
1. RMAN AUTOBACKUP limitation: Must have backups in recovery_area
   - Solution: Copy ALL backups from F:\ (NFS) to C:\...\recovery_area
   - Performance: 6.7 GB copied in ~2 minutes

2. Oracle service persistence issue: Service remains after sc delete
   - Solution: Use oradim -delete -sid ROA (proper Oracle cleanup)
   - Bonus: Delete registry keys to ensure clean state

**Current Status:**
- Cleanup:  TESTED (oradim works perfectly)
- Backup copy:  TESTED (6.7 GB in 2 min)
- RMAN restore: 🟡 IN PROGRESS (expected completion 03:35-03:40)

**Updated:**
- DR_UPGRADE_TO_CUMULATIVE_PLAN.md: Progress tracking + solutions documented
- rman_restore_final.cmd: Use F:\ mount point

🤖 Generated with Claude Code
2025-10-10 03:29:25 +03:00
Marius
8682e0ee04 Oracle DR: Complete Phase 5 - NFS mount point configuration
Phase 5 implementation completed:
- NFS server installed on Proxmox (nfs-kernel-server)
- NFS export configured: /mnt/pve/oracle-backups → VM 109
- Windows NFS Client enabled in VM 109
- F:\ drive auto-mount at startup via scheduled task
- PowerShell script: D:\Oracle\Scripts\mount-nfs.bat
- Directory permissions set to 777 for Windows compatibility
- Mount persists across VM reboots

Files updated:
- DR_UPGRADE_TO_CUMULATIVE_PLAN.md: Status → Phases 1-3-5 COMPLETED
- Added detailed Phase 5 documentation with step-by-step setup
- Updated validation checklist (8 items completed)

Next: Phases 4, 6, 7 (scheduled tasks, restore script, testing)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-10 01:21:47 +03:00
Marius
ec77bb3ddf WIP: Oracle DR CUMULATIVE backup upgrade - Phases 1-3 completed
COMPLETED:
- Phase 1: Proxmox host storage (/mnt/pve/oracle-backups/ROA/autobackup)
- Phase 2: RMAN script already has CUMULATIVE keyword
- Phase 3: Transfer scripts updated for Proxmox host
  * transfer_incremental.ps1: 10.0.20.37:22122 → 10.0.20.202:22
  * transfer_to_dr.ps1: Same change
  * Converted Windows PowerShell to Linux bash commands
- VM 109 cleanup: ~6.4 GB freed, RMAN catalog cleaned

NEW FILES:
- copy_existing_key_to_proxmox.ps1: Setup script for SSH key
- setup_ssh_keys_for_proxmox.ps1: Alternative setup (not used)

PENDING (Next Session):
- Run copy_existing_key_to_proxmox.ps1 on PRIMARY as Administrator
- Phase 4: Modify scheduled tasks (13:00 + 18:00)
- Phase 5: Configure mount point on VM 109 (F:\ drive)
- Phase 6: Update restore script for F:\ mount
- Phase 7: Test FULL + CUMULATIVE backup and restore

DOCUMENTATION:
- DR_UPGRADE_TO_CUMULATIVE_PLAN.md: Added implementation status

🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-09 22:45:32 +03:00
Marius
ac2340c967 Oracle DR: Complete Windows VM implementation and cleanup
Major changes:
- Implemented Windows VM 109 as DR target (replaces Linux LXC)
- Tested RMAN restore successfully (12-15 min RTO, 24h RPO)
- Added comprehensive DR documentation:
  * DR_WINDOWS_VM_STATUS_2025-10-09.md - Current implementation status
  * DR_UPGRADE_TO_CUMULATIVE_PLAN.md - Plan for cumulative incremental backups
  * DR_VM_MIGRATION_GUIDE.md - Guide for VM migration between Proxmox nodes
- Updated DR_WINDOWS_VM_IMPLEMENTATION_PLAN.md with completed phases

New scripts:
- add_system_key_dr.ps1 - SSH key setup for automated transfers
- configure_listener_dr.ps1 - Oracle Listener configuration
- fix_ssh_via_service.ps1 - SSH authentication fix
- rman_restore_final.cmd - Working RMAN restore script (tested)
- transfer_to_dr.ps1 - FULL backup transfer (renamed from 02_*)
- transfer_incremental.ps1 - Incremental backup transfer (renamed from 02b_*)

Cleanup:
- Removed 19 obsolete scripts for Linux LXC DR
- Removed 8 outdated documentation files
- Organized project structure

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-09 18:54:08 +03:00
Marius
6a6ffe84af standby roa vm windows 2025-10-08 15:33:07 +03:00
Marius
a68f6c381f Fix: Preserve original encoding in claude-mcp-toggle
The save_config function was using json.dump with default ensure_ascii=True,
which converted all Romanian characters to \uXXXX escape sequences, making
the entire ~/.claude.json file appear modified even when only changing MCP
server configuration for a specific project.

Added ensure_ascii=False to preserve original UTF-8 encoding and minimize
file changes to only the intended MCP server modifications.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-08 15:27:16 +03:00
Marius
dfc64ec632 Add Claude Code MCP Server Manager utility and Oracle DR troubleshooting
- Add claude-mcp-toggle: CLI tool for managing MCP servers
  - Enable/disable individual MCP servers
  - Enable/disable all servers
  - Set specific servers (disable all, enable selected)
  - Interactive mode with menu
  - List servers with enabled/disabled status
- Add comprehensive README with usage examples
- Add Oracle DR restore troubleshooting documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-08 14:58:23 +03:00
Marius
d5bfc6b5c7 Add Oracle DR standby server scripts and Proxmox troubleshooting docs
- Add comprehensive Oracle backup and DR strategy documentation
- Add RMAN backup scripts (full and incremental)
- Add PowerShell transfer scripts for DR site
- Add bash restore and verification scripts
- Reorganize Oracle documentation structure
- Add Proxmox troubleshooting guide for VM 201 HA errors and NFS storage issues

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-08 13:37:33 +03:00
Marius
95f76d7ffb Fix: Add LXC container shutdown to UPS emergency shutdown script
The ups-shutdown-cluster.sh script was missing LXC container shutdown
functionality, only shutting down VMs. This could leave containers
running during UPS power failure, causing ungraceful shutdown.

Changes:
- Added Step 2: LXC container shutdown on all cluster nodes
- Uses 'pct list' to find running containers
- Shuts down each container with 60s timeout
- Parallel shutdown with '&' for speed
- Both local (pvemini) and remote nodes (pve1, pveelite)
- Updated step numbers (now 6 steps total vs 5 before)
- Fixed log_message() to use dynamic timestamp
- Fixed node name comment (pve2 → pveelite)

Shutdown order:
1. VMs on all nodes (timeout 60s)
2. LXC containers on all nodes (timeout 60s) [NEW]
3. Wait 90 seconds for graceful shutdown
4. Secondary nodes shutdown (pve1, pveelite)
5. Wait 30 seconds
6. Primary node shutdown (pvemini)

This matches the behavior in ups-maintenance-shutdown.sh which already
had LXC support.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-06 21:49:28 +03:00
Marius
cc72a5f96e Add UPS battery replacement procedure and maintenance shutdown script
Adds complete procedure for replacing UPS battery when entire cluster
is powered by the same UPS, requiring full cluster shutdown.

New files:
- scripts/ups-maintenance-shutdown.sh: Automated orchestrated shutdown
  for maintenance operations with confirmation prompts and progress display
- docs/UPS-BATTERY-REPLACEMENT.md: Complete step-by-step guide for battery
  replacement including pre-shutdown, physical replacement, and post-startup
  verification procedures

Features:
- Orchestrated shutdown: VMs → LXC containers → secondary nodes → primary
- Interactive confirmation before shutdown
- Color-coded progress indicators
- Countdown timers for each phase
- Post-replacement verification checklist
- Troubleshooting guide for common issues
- Recovery procedures for cluster/quorum problems

The procedure accounts for all 3 cluster nodes (pve1, pvemini, pveelite)
being on the same UPS, requiring complete infrastructure shutdown.

Documentation includes:
- When to replace battery (based on monthly test results)
- Pre-planning and user notification templates
- Physical battery replacement safety procedures
- Cluster recovery and VM restart procedures
- Post-replacement testing and verification
- 24-hour and 1-week monitoring checklists

Estimated maintenance window: 30-60 minutes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-06 21:46:28 +03:00