Commit Graph

86 Commits

Author SHA1 Message Date
Marius
989477f7a4 Add ROA Oracle Database Windows setup scripts with old client support
PowerShell scripts for setting up Oracle 21c/XE with ROA application:
- Automated tablespace, user creation and imports
- sqlnet.ora config for Instant Client 11g/ODBC compatibility
- Oracle 21c read-only Home path handling (homes/OraDB21Home1)
- Listener restart + 10G password verifier for legacy auth
- Tested on VM 302 with CONTAFIN_ORACLE schema import

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 17:08:02 +02:00
Marius
665c2b5d37 Add Oracle 18c sqlnet.ora config for old ODBC/Instant Client 11g compatibility
- Add config/sqlnet.ora with ALLOWED_LOGON_VERSION=8 for old client support
- Add scripts/fix-sqlnet.sh startup script to persist config across container restarts
- Update README with ORA-28040 troubleshooting, ODBC connection params, and deployment instructions
- Fix SID description: Oracle 18c has PDB (XEPDB1), not non-CDB
- Update container recreation instructions with startup scripts volume

Resolves ORA-28040: No matching authentication protocol when connecting
from Windows ODBC with Oracle Instant Client 11.2

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 00:22:03 +02:00
Marius
fb474c3726 curatare 2026-01-27 23:43:20 +02:00
Marius
d2b24c1c47 Update Oracle 18c/21c export scripts and documentation
- Increase LXC 108 memory from 4GB to 8GB + 2GB swap
- Add manual startup/shutdown instructions for Oracle containers
- Document CDB/PDB architecture and correct connection strings
- Fix export-roa2.sh: use XEPDB1 PDB for Oracle 18c, separate DMPDIR
- Fix export-roa2.ps1: dual DMPDIR paths, auto-start containers
- Add container/database status checks before export
- Add TNS entries with SERVICE_NAME=XEPDB1 (not SID=XE)
- Document DBMS_CUBE_EXP warnings as harmless

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27 23:42:34 +02:00
Marius
7c6e54f018 Add CLAUDE.md for Claude Code guidance
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27 18:07:23 +02:00
Marius
c5936791d0 Rename claude-agent to lxc171-claude-agent with full setup documentation
- Rename proxmox/claude-agent/ to proxmox/lxc171-claude-agent/
- Move scripts to scripts/ subdirectory
- Add complete installation guide for new LXC from scratch
- Update proxmox/README.md with LXC 171 documentation and navigation
- Add LXC 171 to containers table
- Remove .serena/project.yml

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27 17:54:17 +02:00
Marius
a567f75f25 Reorganize oracle/ and chatbot/ into proxmox/ per LXC/VM structure
- Move oracle/migration-scripts/ to proxmox/lxc108-oracle/migration/
- Move oracle/roa/ and oracle/roa-romconstruct/ to proxmox/lxc108-oracle/sql/
- Move oracle/standby-server-scripts/ to proxmox/vm109-windows-dr/
- Move chatbot/ to proxmox/lxc104-flowise/
- Update proxmox/README.md with new structure and navigation
- Update all documentation with correct directory references
- Remove unused input/claude-agent-sdk/ files

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27 17:28:53 +02:00
Marius
4d51d5b2d2 Reorganize proxmox documentation into subdirectories per LXC/VM
- Create cluster/ for Proxmox cluster infrastructure (SSH guide, HA monitor, UPS)
- Create lxc108-oracle/ for Oracle Database documentation and scripts
- Create vm201-windows/ for Windows 11 VM docs and SSL certificate scripts
- Add SSL certificate monitoring scripts (check-ssl-certificates.ps1, monitor-ssl-certificates.sh)
- Remove archived VM107 references (decommissioned)
- Update all cross-references between files
- Update main README.md with new structure and navigation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27 17:02:49 +02:00
Marius
1da4c2347c Fix Oracle 10g compatibility in PACK_CONTAFIN SCRIE_JC_2007 procedure
Replace FORALL bulk operations with FOR loops to avoid PLS-00436 error
on Oracle 10.2.0.5. The older Oracle version does not support referencing
record fields from collection in FORALL statements.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-15 13:06:17 +02:00
Marius
a96b9b8d8b romconstruct 2026-01-15 13:00:14 +02:00
Marius
1011d9202c Fix UPS notifications and add periodic battery status emails
- Fix permission denied on log files (chown nut:nut)
- Fix upssched.conf permissions (root:nut)
- Add sudo for perl to allow PVE::Notify from user nut
- Add periodic battery status emails every minute when on battery
- Add charging status emails at 5, 10, 30 min after power restore
- Remove diacritics from all notification messages
- Update documentation with sudo and permissions setup

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-14 13:06:25 +02:00
Marius
ab6ac77d50 Add UPS email notifications and automatic UPS shutdown
- Add email notifications via PVE::Notify for all UPS events:
  - ONBATT: when UPS switches to battery
  - ONLINE: when power is restored
  - LOWBATT: critical battery level
  - SHUTDOWN_START/NODE/PRIMARY: during cluster shutdown
  - COMMBAD: communication lost with UPS

- Add automatic UPS shutdown command after cluster shutdown
  (protects against power surge when power returns)

- Update upssched.conf with ONLINE handler and immediate ONBATT notification

- Add notification templates for HTML and text emails

- Update documentation with new features and timer configuration

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-13 20:11:30 +02:00
Marius
e0f84298e9 curatare 2026-01-11 15:29:12 +02:00
Marius
00c6410dbd Document VM 201 power outage incident and update HA configuration
- Add troubleshooting guide for 2026-01-11 power outage incident
- Update vm201-windows11.md with correct storage details (disk-1, disk-3)
- Remove HA configuration, document manual failover procedure
- Add ZFS replication status and commands
- Document lessons learned: ISO attachments block migration

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-11 15:28:00 +02:00
Marius
594b77e449 Add Claude Agent LXC setup and workflow scripts
- Create LXC 171 (claude-agent) on pveelite with Ubuntu 24.04
- Install Node.js 20.x, Claude Code, tmux, Tailscale
- Configure SSH access and Gitea integration
- Add workflow scripts: start-agent.sh, work.sh, new-task.sh, finish-task.sh
- Add code-server for mobile file browsing
- Document complete setup in proxmox/claude-agent/README.md

LXC Details:
- IP internal: 10.0.20.171
- IP Tailscale: 100.95.55.51
- code-server: port 8080

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-31 18:53:23 +02:00
Marius
f01341a707 Add Claude Code configuration
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-31 01:03:23 +02:00
Marius
42f5d5ac85 Add chatbot infrastructure documentation
Document Flowise and ngrok configuration on LXC 104, including
troubleshooting steps for CORS and version issues.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-31 01:02:50 +02:00
Marius
91b9e08e9d adunare generala 2025-12-20 22:23:11 +02:00
Marius
90d77704d6 Reorganize Proxmox documentation with clear structure and VM/LXC mapping
## Changes

### Documentation Reorganization
- **README.md**: Complete restructure with logical sections
  - Infrastructure General (proxmox-ssh-guide.md)
  - LXC Containers (oracle-database-lxc108.md)
  - Virtual Machines (vm201-*.md)
  - Cluster-Wide Resources (cluster-ha-monitor.sh, ups/)
  - Archived/Decommissioned (archived-vm107-monitor.sh)
  - Added quick navigation "Am nevoie să..." section
  - Added recommended workflows
  - Added complete directory structure map

- **proxmox-ssh-guide.md**: Added documentation references section
  - Clear links to all related documentation
  - When to use each document
  - Quick start snippets for each resource

### File Renames for Clarity
- `certificat-letsencrypt-iis.md` → `vm201-certificat-letsencrypt-iis.md`
- `troubleshooting-vm201-backup-nfs.md` → `vm201-troubleshooting-backup-nfs.md`
- `ha-monitor.sh` → `cluster-ha-monitor.sh`
- `vm107-monitor.sh` → `archived-vm107-monitor.sh`

### New Documentation
- **vm201-windows11.md**: Complete VM 201 documentation
  - Hardware configuration
  - Installed services (IIS, SQL*Plus, WinNUT, RDP)
  - Network configuration
  - Backup and recovery procedures
  - Common troubleshooting

## Benefits
- Clear naming convention: VM/LXC/Cluster prefixes
- Central index in README.md with navigation
- Cross-references between documents
- Complete VM 201 documentation suite
- Clear archival of decommissioned resources

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-19 13:43:44 +02:00
Marius
cd7b2ed9e7 Clarify storage configuration and fix node names
Storage Configuration improvements:
- Add "Noduri" column showing which nodes have access to each storage
- Clarify that 'local' is separate on each node (non-shared)
- Clarify that 'local-zfs' is shared across pvemini, pve1, pveelite
- Clarify that 'backup' is only on pvemini (10.0.20.201)
- Add detailed explanations for each storage type
- Add storage paths section with important locations

Node name corrections:
- Fix node name: pve2 → pveelite (correct cluster name)
- Update all references across proxmox-ssh-guide.md and README.md
- Add node descriptions in tables for clarity

Benefits:
- Users now know exactly which storage is available on which nodes
- Clear distinction between shared and non-shared storage
- Correct node naming throughout documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-19 13:01:08 +02:00
Marius
f1b982794b Reorganize Oracle and Proxmox documentation structure
- Move oracle/CONEXIUNI-ORACLE.md → proxmox/oracle-database-lxc108.md
- Create proxmox/README.md as documentation index
- Update proxmox-ssh-guide.md:
  * Remove VM 107 references (decommissioned)
  * Update LXC and VM tables with IP addresses
  * Add IP address map for all services
  * Simplify Oracle section (detailed info in oracle-database-lxc108.md)
  * Update backup job configuration

Benefits:
- All infrastructure docs in proxmox/ directory
- Clear separation: general Proxmox (proxmox-ssh-guide.md) vs Oracle-specific (oracle-database-lxc108.md)
- No duplicate information between files
- Easy navigation with README.md index

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-19 12:58:22 +02:00
Marius
b4c2a24281 Fix Oracle DR test ORA-00600 error by forcing service shutdown in cleanup
Problem: DR weekly test failed with ORA-00600 [kcbzib_kcrsds_1] when executed
via cron, but succeeded when run manually. Error occurred during "ALTER DATABASE
OPEN RESETLOGS" step after successful restore and recovery.

Root Cause Analysis:
- Manual test (12:09): Undo initialization = 0ms, no errors
- Cron test (10:45): Undo initialization = 2735ms, ORA-00600 crash
- Alert log showed: "Undo initialization recovery: err:600"
- Oracle instance was in inconsistent state from previous run

The cleanup_database.ps1 script had an "optimization" that preserved the
running Oracle service to "save ~30s startup time". This left the service
in an inconsistent state between test runs, causing Oracle to crash when
attempting to open the database with RESETLOGS.

Solution:
Modified cleanup_database.ps1 to ALWAYS stop Oracle service completely:
1. SHUTDOWN ABORT the instance (not just when /AFTER flag)
2. Stop-Service OracleServiceROA (force clean state)
3. Kill remaining oracle processes
4. Service starts fresh during restore (clean Undo initialization)

Changes:
- Removed if/else branch that skipped shutdown before restore
- Always perform full shutdown regardless of /AFTER parameter
- Updated messages to reflect clean state approach
- Added explanation: "This ensures no state inconsistencies (prevents ORA-00600)"

Testing: Manual test confirmed clean 0ms Undo initialization after fix.

Related: Works in conjunction with weekly-dr-test-proxmox.sh PATH fix (commit 34f91ba)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-06 12:25:38 +02:00
Marius
34f91ba206 Fix Oracle DR test cron execution by adding explicit PATH
Problem: The weekly DR test script worked when run manually but failed
when executed via cron with "Failed to start VM 109" error at 0 seconds.

Cause: Cron jobs run with a minimal PATH that doesn't include /usr/sbin
where Proxmox commands (qm, pvesh, etc.) are located. Manual execution
had the full PATH including /usr/sbin.

Solution: Added explicit PATH export at the start of the script to ensure
all required system binaries are accessible regardless of execution context.

Testing: Successfully verified with cron test at 11:32 - VM started properly,
restore process completed normally.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-06 11:37:27 +02:00
Marius
c715a0a89d transfer backups 2025-10-31 01:19:15 +02:00
Marius
63bcdf5c7f copiere backup 2025-10-31 01:05:31 +02:00
Marius
13a7cd6d96 IIS SSL certificat 2025-10-26 18:52:44 +02:00
Marius
bc75ce30c2 Add chatbot documentation and Claude agent SDK resources 2025-10-21 16:07:35 +03:00
Marius
132b4fb34b Proxmox HA: Fix false FAILED alerts and suppress cron notification emails
Fixed two critical issues with HA monitoring:
1. False positive quorum errors - corosync-quorumtool not in cron PATH
2. Unwanted cron emails from PVE::Notify INFO messages to STDERR

Changes:
- Set proper PATH including /usr/sbin for corosync-quorumtool
- Split notification code: verbose shows all, non-verbose redirects STDERR to /dev/null
- Prevents cron from sending duplicate notification emails

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-16 13:35:43 +03:00
Marius
8bb494c60e Oracle DR: Fix backup retention to keep exactly 2 days instead of 3
Changed -mtime logic from +$RetentionDays to +($RetentionDays - 1) to correctly implement 2-day retention. Previously kept 3 days (today + 2 previous), now keeps exactly 2 days (today + yesterday).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-15 22:54:02 +03:00
Marius
b50cc2b8c4 Oracle DR: Fix backup retention and monitoring for new naming convention
Problem: Backups accumulated on DR (73 files, 4 days) instead of keeping only 2 days
- transfer_incremental.ps1 had no cleanup function (ran 2x/day without cleanup)
- transfer_to_dr.ps1 cleanup had poor logging
- oracle-backup-monitor-proxmox.sh couldn't detect new L0/L1 backup format

Changes:
- Add cleanup to transfer_incremental.ps1 (delete backups older than 2 days)
- Improve cleanup logging in transfer_to_dr.ps1 (shows count before/after)
- Update oracle-backup-monitor-proxmox.sh to detect both naming conventions:
  * Old: *FULL*.BKP, *INCR*.BKP
  * New: L0_*.BKP (Level 0), L1_*.BKP (Level 1)
- Remove temporary files from /input/ directory

Result: Monitor now correctly reports backup age, cleanup runs after each transfer

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-14 18:05:11 +03:00
Marius
249bf4d98a debugging 2025-10-11 19:18:37 +03:00
Marius
1b523c1624 Oracle DR: Add comprehensive restore test debugging guide to README
- Add section 'Debugging Restore Tests' with practical troubleshooting commands
- Check backup files on Proxmox: list, count, verify timestamps
- Verify backup files on DR VM: NFS mount, file counts, sizes
- Check DR test results: parse logs for PASSED/FAILED status
- Simulate test locally: manual restore steps for debugging
- Common issues table with checks and fixes
- Verify naming convention is active (L0_*, L1_* format)
- Manual test run with verbose output for real-time monitoring

Helps diagnose issues like:
- False FAILED notifications
- Missing datafiles
- RMAN-06023 errors
- Backup selection problems

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 19:03:08 +03:00
Marius
8da1208ca7 Oracle DR: Fix false FAILED notification - parse database status from log
- Replace complex SSH+PowerShell query with simple log file parsing
- rman_restore_from_zero.ps1 already verifies and outputs database status
- Parse 'OPEN_MODE: READ WRITE' and 'TABLES: <count>' from LOG_FILE
- Fixes issue where successful restore was reported as FAILED
- More reliable: avoids SSH escaping issues with Select-String -Quiet

Root cause: SSH+PowerShell+sqlplus+Select-String chain was too fragile and
returned empty/false even when database was successfully opened (42625 tables).

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 18:55:05 +03:00
Marius
a7273d1820 Oracle RMAN: Fix backup location - add full path to FORMAT
- Add full path to FORMAT in rman_backup.txt and rman_backup_incremental.txt
- Files now stored in C:\Users\oracle\recovery_area\ROA\autobackup- Fixes issue where backups were created in ORACLE_HOME\DATABASE instead of recovery area
- Ensures transfer_to_dr.ps1 can find and transfer all backups correctly

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 16:26:56 +03:00
Marius
62848e635d Oracle DR: Add naming convention to RMAN backups for smart restore selection
- Add FORMAT to rman_backup.txt: L0_*, ARC_*, SPFILE_*, CF_*
- Add FORMAT to rman_backup_incremental.txt: L1_*, ARC_*, SPFILE_*, CF_*
- Update rman_restore_from_zero.ps1 TestMode to select files by naming convention
- Select only latest L0 backup set + all L1 incrementals/archives (faster DR tests)
- Backward compatible with old autobackup naming (fallback to copy all)
- Fixes missing datafiles issue (previously only copied 8 files, now copies full backup set)

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 16:12:41 +03:00
Marius
f1002d6e4a Oracle DR: Add /AFTER parameter to cleanup - smart shutdown based on context
Critical fix based on user analysis:

PROBLEM:
Cleanup is called in 2 contexts with different requirements:
1. BEFORE restore (from rman_restore): Should NOT shutdown
2. AFTER restore (from weekly-test): MUST shutdown to delete files

USER INSIGHT:
"Why shutdown if restore will clean anyway? But AFTER restore,
you MUST shutdown to release file locks for deletion!"

SOLUTION:
Add /AFTER parameter to cleanup_database.ps1:

WITHOUT /AFTER (before restore):
- Skip SHUTDOWN ABORT
- Skip Stop-Service
- Leave service in current state (running/stopped)
- Files CAN be deleted (no lock before restore)
- Optimization: If service running → restore saves ~30s

WITH /AFTER (after restore):
- SHUTDOWN ABORT (stop instance)
- Stop-Service (release file locks)
- REQUIRED for file deletion after restore
- Files are locked by active instance/service

CALL SITES:
1. rman_restore: cleanup_database.ps1 /SILENT (no /AFTER)
2. weekly-test: cleanup_database.ps1 /SILENT /AFTER (with /AFTER)

FLOW OPTIMIZATION:
Test 1: Service stopped → start(30s) → restore → cleanup /AFTER
Test 2: Service stopped → start(30s) → restore → cleanup /AFTER
→ No improvement yet

BUT if we keep service running between tests:
Test 1: Service stopped → start(30s) → restore → cleanup /AFTER
Test 2: Service running → restore(0s saved!) → cleanup /AFTER
→ Save 30s on subsequent tests!

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 15:34:00 +03:00
Marius
5af33fc217 Oracle DR: Add SHUTDOWN ABORT before NOMOUNT to clean auto-started instance
Critical fix for service auto-start behavior:

Problem (identified by user):
- When Oracle service starts, it automatically tries to start instance
- Uses configured PFILE which references control files
- After cleanup, control files don't exist
- Instance ends up in partial/error state
- STARTUP NOMOUNT may fail or behave unexpectedly

Root Cause:
- Oracle service on Windows has auto-start behavior
- Service startup takes ~30s trying to start instance
- Without valid control files, instance is partially started
- This interferes with manual STARTUP NOMOUNT

Solution:
Before STARTUP NOMOUNT, explicitly clean any existing instance:
```sql
SHUTDOWN ABORT;  -- Clean any partial instance
STARTUP NOMOUNT PFILE='...';  -- Fresh clean start
```

Implementation:
- Use WHENEVER SQLERROR CONTINUE (SHUTDOWN may error if no instance)
- Explicit SHUTDOWN ABORT before NOMOUNT
- Ensures clean instance state for RMAN restore
- Service running + clean NOMOUNT instance = ready for restore

User requirement met: Instance in NOMOUNT state (not mounted/open)

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 15:18:45 +03:00
Marius
4b7bb29b9e Oracle DR: Fix service start with polling and timeout (not blocking)
Critical fix - service MUST be running for SQL*Plus connection:

Problem (confirmed by user):
- After cleanup, service is stopped
- sqlplus / as sysdba → ORA-12560: TNS:protocol adapter error
- Start-Service blocks indefinitely (user saw 25+ warnings)
- Service takes ~30 seconds to start

Previous attempt (WRONG):
- Assumed SQL*Plus works with stopped service ✗
- User proved ORA-12560 occurs when service stopped ✓

Correct Solution:
- Start service in background job (non-blocking)
- Poll service status every 3 seconds
- Timeout after 60 seconds (2x expected startup time)
- Progress logging every 15 seconds
- Cleanup background job when done

Implementation:
```powershell
Start-Job { Start-Service OracleServiceROA }
while (elapsed < 60s) {
    if (service.Status == Running) → break
    sleep 3s
}
```

Result:
- Service starts in ~30s (user confirmed)
- Script doesn't block
- SQL*Plus can connect successfully
- Graceful fallback if timeout exceeded

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 15:14:24 +03:00
Marius
e4df4c11d8 Oracle DR: Fix service start hang - don't start stopped service
Critical fix for service preservation:

Problem:
- After cleanup, Oracle service is stopped
- Start-Service attempts to start the instance automatically
- Without database files, service startup hangs indefinitely
- PowerShell Start-Service blocks waiting for service to start

Root Cause:
- Oracle service on Windows tries to auto-start the instance
- With no controlfile/database files, it cannot start
- Start-Service waits forever (user reported 25+ warnings)

Solution:
- Do NOT attempt to start the stopped service
- SQL*Plus can connect '/ as sysdba' even if service is stopped
- STARTUP NOMOUNT will manually start the instance
- This is the correct Oracle workflow for restore from zero

Windows SQL*Plus requirements:
✓ ORACLE_SID set (we set this)
✓ Service exists in registry (preserved after cleanup)
✓ ORACLE_HOME set (we set this)
✗ Service running NOT required for NOMOUNT startup

The service will naturally transition to Running state when
STARTUP NOMOUNT successfully starts the instance.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 15:12:21 +03:00
Marius
4256d5a914 Oracle DR: Optimize backup copy - TestMode only copies latest backup set
Major performance optimization for weekly DR tests:

TestMode (weekly testing):
- Copy ONLY latest full backup + everything after it
- Includes: latest DAILY_FULL + incrementals + controlfiles + SPFILE
- Excludes: older full backups (not needed for testing)
- Benefit: ~60-70% reduction (14GB → 4-5GB)
- Copy time: 2min → 30-45sec (saves ~1-1.5 min)
- Risk: Low - testing only needs to verify latest backup works

Standalone Mode (real DR):
- Copy ALL backups (unchanged behavior)
- Includes: all full backups + redundancy for fallback
- Benefit: Maximum safety for disaster recovery
- If latest backup corrupted → RMAN uses previous backup

Implementation:
- Finds latest *DAILY_FULL*.BKP (Level 0 backup)
- Gets its timestamp
- Copies all *.BKP files >= that timestamp
- Automatic inclusion of incrementals, controlfiles, SPFILE backups

Combined optimization results:
- VM polling: saves 60-120s
- Service preservation: saves 40s
- Backup copy (TestMode): saves 60-90s
Total: 160-250 seconds (2.5-4 minutes) per test

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 14:58:12 +03:00
Marius
5750b42836 Oracle DR: Replace fixed VM boot wait with intelligent polling
Performance optimization for VM startup:

Before: Fixed 180s wait regardless of actual boot time
After: Intelligent polling with early exit when VM is ready

Implementation:
- Poll every 5 seconds (max 180s timeout)
- Check 1: VM running status in Proxmox (qm status)
- Check 2: SSH connectivity test
- Check 3: PowerShell availability (what we actually need)
- Exit immediately when all checks pass
- Progress logging every 30 seconds
- Fallback: Continue after 180s with warning

Benefits:
- Fast VM boot (30s) → saves 150s (2min 30s)
- Normal VM boot (60s) → saves 120s (2min)
- Slow VM boot → 180s (same as before)
- More robust: verifies SSH+PowerShell actually work

Average expected improvement: 60-120 seconds per test

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 14:30:32 +03:00
Marius
835d8b465b Oracle DR: Fix database verification, add bash log, and collect full RMAN log
Critical fixes and improvements:

1. Database verification fix (robust):
   - Use Select-String -Quiet to get True/False boolean
   - Convert PowerShell boolean to bash-friendly format
   - Check for 'READ WRITE' in entire sqlplus output
   - Eliminates false negatives from text parsing issues

2. Collect FULL RMAN restore log:
   - Removed -Head 200 limitation
   - Now sends complete RMAN log in email
   - Better debugging with full context
   - Updated templates: "first 200 lines" → "complete"

3. Add bash script log to email notifications:
   - Include last 100 lines of bash execution log
   - Separate "RMAN Restore Log" and "Bash Script Log" sections
   - Both text and HTML templates updated
   - Shows script flow and any bash-level errors

This fixes the issue where 42,625 tables were restored successfully
but test reported FAILED due to query output format mismatch.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 14:25:58 +03:00
Marius
12700261c7 Oracle DR: Fix database verification and restore log collection
Critical fixes for false negatives in DR test reporting:

1. Database verification fix:
   - Changed from 'findstr' (CMD) to 'Select-String' (PowerShell native)
   - findstr was failing in PowerShell context causing db_status to be empty
   - Result: DB with 42,625 tables was incorrectly reported as FAILED

2. Restore log collection fix:
   - Changed from 'type' (CMD) to 'Get-Content' (PowerShell native)
   - type command doesn't work through SSH PowerShell context
   - Added -ErrorAction SilentlyContinue for cleaner error handling
   - Simplified fallback logic using [-z] instead of string matching

Both issues were caused by mixing CMD commands in PowerShell context.
Now uses PowerShell-native commands throughout for consistency.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 11:11:37 +03:00
Marius
c4504cac70 Oracle DR: Increase recovery area size to 50G
Adjust db_recovery_file_dest_size in auto-generated PFILE:
- Previous: 20G
- New: 50G
- Reason: Provide more space for RMAN restore operations and backups

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 11:06:33 +03:00
Marius
eade344f28 Oracle DR: Auto-create PFILE if missing using tested configuration
Enhancement to rman_restore_from_zero.ps1:
- Auto-generate initROA.ora if not found at service creation
- Uses exact tested configuration from initROA.ora:
  - memory_target=1024M (tested DR VM allocation)
  - _allow_resetlogs_corruption=TRUE (critical for DR restore!)
  - control_files in oradata + recovery_area
  - Standard Oracle 19c parameters for DR environment

Benefits:
- Script is now fully self-sufficient
- No manual PFILE setup required
- DR VM can be restored from completely clean state
- Uses battle-tested configuration

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 11:05:17 +03:00
Marius
5bed910b1c Oracle DR: Optimize test speed by preserving service between tests
Performance improvements:
- cleanup_database.ps1: Skip service deletion (saves ~25s per test)
  - Remove oradim -delete, sc.exe delete, registry cleanup
  - Add SPFILE deletion to ensure PFILE-based startup
  - Service now persists between tests for reuse

- rman_restore_from_zero.ps1: Smart service check (saves ~15s per test)
  - Check if service exists before creating
  - Skip oradim -new if service already present
  - Only create service on first run or if missing

Total time savings: ~40 seconds per weekly DR test
Service lifecycle: Created once, reused indefinitely until manual cleanup

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 10:59:43 +03:00
Marius
3a51880c9e Oracle DR: Fix RMAN crosscheck sequence and improve error handling
- Fix CROSSCHECK BACKUP command to execute after database is mounted
- Correct CATALOG command to use recovery_area instead of F:\ path
- Add robust backup file validation with detailed error reporting
- Improve file-by-file backup copying with individual error tracking
- Enhance restore log collection for both success and failure scenarios
- Fix database verification to check OPEN_MODE instead of STATUS
- Add comprehensive directory and permissions error handling

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 10:32:49 +03:00
Marius
9ed0ee9e0e Oracle DR: Add TestMode parameter for dual behavior
rman_restore_from_zero.ps1:
- Add -TestMode switch parameter
- TestMode (weekly DR test): Skip service/listener config, only verify restore works
- Standalone mode: Full config with SPFILE + Listener for production use

weekly-dr-test-proxmox.sh:
- Call restore script with -TestMode flag
- Avoids service recreation and SSH disconnect during tests

Benefits:
- Weekly tests are faster and cleaner (no service restart)
- Manual restore prepares system for production use
- No more 'Broken pipe' errors during tests

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 02:33:43 +03:00
Marius
f79331f7cc Oracle DR: Fix service recreation causing SSH disconnect
Remove service delete/recreate at step 3.3 that was causing 'Broken pipe' error
Service is already configured with auto-start at step 2.1 - no need to recreate

Issue: oradim -delete was killing running database and breaking SSH connection
Solution: Skip recreation, service already has correct auto-start configuration

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 02:29:38 +03:00
Marius
6accd1f996 Oracle DR: Fix verification commands and auto-start services
weekly-dr-test-proxmox.sh:
- Replace Unix commands (echo, grep) with PowerShell equivalents
- Use PowerShell Select-String for database status verification
- Fix table count query to work properly through SSH

rman_restore_from_zero.ps1:
- Set Oracle service to AUTOMATIC startup (was manual)
- Set Listener service to AUTOMATIC startup
- Auto-start Listener after database restore
- Add fallback to lsnrctl if service start fails

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-10-11 02:03:57 +03:00