clienti/ contained Oracle XE 21c troubleshooting and PDB recreation
scripts. Moving it under proxmox/lxc108-oracle/ keeps Oracle migration
material colocated with the Oracle host docs. Update the two relative
links in roa-windows-setup that pointed to the old location.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move VM302-TESTING.md from lxc108-oracle/roa-windows-setup/test/ into a
new proxmox/vm302-oracle-test/ directory (sibling of vm109/vm201) so the
test environment is documented separately from the setup scripts. Add a
dual-edition test plan (XE validated / SE TODO) and a stub for capturing
the production SE errors next time they reproduce.
Cross-link from roa-windows-setup/README.md, proxmox/README.md master
index and CLAUDE.md entry points. Setup scripts stay in lxc108-oracle —
they are not VM-specific.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VM 109 returned to its original home on pveelite, co-located with
oracle-backups NFS storage. The README is updated to reflect that:
the VM is now in HA (ha-prefer-pveelite, state=stopped, nofailback=1)
rather than excluded from HA, and the new layered defences (trap
guard, watchdog cron, dynamic memory pre-flight, max_restart caps)
are documented alongside the original 8a0c557 trap.
Adds a Storage Failover section describing the pveelite -> pvemini
manual failover flow: email alert from pveelite-down-alert.sh,
failover-dr-to-pvemini.sh on the surviving node, failback when
pveelite returns. The pve1 nightly mirror is the third copy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
failover-dr-to-pvemini.sh and failback-dr-to-pveelite.sh promote/demote
the rpool/oracle-backups dataset between nodes when pveelite is down.
Both refuse to run if the other side is reachable to prevent split-brain.
Both patch transfer_backups.ps1 on Oracle Production (10.0.20.36) via
SSH to redirect the daily SCP target between 10.0.20.202 and 10.0.20.201.
The PowerShell patch uses -EncodedCommand (UTF-16LE base64) so the bash
caller does not need to escape PowerShell quoting. End-to-end test
including failover -> failback confirmed transfer_backups.ps1 returns
to byte-identical state (SHA256 43DD2187...).
pveelite-down-alert.sh runs every minute on pvemini and emails an alert
with copy-paste failover instructions after 5 consecutive ping failures.
The alert body includes the latest oracle-backups and VM 109 replica
timestamps so the operator knows the recovery point before deciding.
The DR weekly-test script gains a cluster-aware guard at the top that
exits silently when /etc/pve/qemu-server/109.conf is not on the local
node, allowing the same cron entry to be present on both pveelite and
pvemini without double-firing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Convert /mnt/pve/oracle-backups from a directory on the pveelite
rootfs into a dedicated ZFS dataset rpool/oracle-backups so it can be
incrementally replicated to pvemini. zfs-replicate-oracle-backups.sh
runs every 15 minutes from cron on pveelite and uses zfs send/recv
over the cluster's internal SSH (direct IP, /etc/pve/priv/known_hosts)
to avoid Tailscale magicDNS detours that broke the first attempt.
The destination dataset is set readonly=on so accidental writes on
pvemini cannot diverge it. Snapshot pruning keeps 5 rolling copies.
nightly-backup-mirror.sh ships a third copy nightly to pve1's
backup-ssd (ext4 SATA) — different physical disk, different
filesystem, different node — guarding against the failure mode where
both pveelite and pvemini are simultaneously unavailable. The same
script tars /etc/pve and rotates 14 days of cluster config archives,
since pmxcfs is in-RAM and a multi-node quorum loss would otherwise
take cluster config with it.
The old directory is kept as oracle-backups.old-DELETE-AFTER-2026-05-02
on pveelite for one week as a safety net.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DR test script now refuses to start VM 109 if:
* cluster is not quorate (e.g. mid-failover into a degraded state),
* available memory on the host is below VM 109 config + 1 GB margin.
Both checks scale automatically — memory threshold is computed from
qm config so resizing VM 109 does not require touching the script.
Adds vm109-watchdog.sh, scheduled cluster-wide every minute. The
watchdog is the second line of defence behind the cleanup trap from
8a0c557: it force-stops VM 109 if the trap was bypassed (script
killed, host crash mid-test, manual run forgotten). It honours
/var/run/vm109-debug.flag for legitimate manual sessions and is
node-aware via /etc/pve/qemu-server/109.conf so it can be deployed
on every node without coordinating with VM 109's current location.
Both safeguards target the 04-18 → 04-20 chain: VM 109 left running
2.5 days then sandwiched against an HA failover that pushed CT 108
Oracle (8 GB) onto pveelite (16 GB) → OOM cascade.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cleanup trap added in 8a0c557 stopped VM 109 unconditionally on EXIT,
which kills the VM during --install/--help or when an operator launched
it manually for debugging. Gate the trap with DR_VM_STARTED_BY_US so it
only fires when the script itself started the VM.
Also remove the 2>/dev/null swallow on qm start so cross-node failures
(e.g. running on a node where the VM is not configured) appear in the
log instead of producing a silent "Failed to start VM 109" in 0 seconds.
Root cause for the 2026-04-25 silent failure: cron lived on pveelite
while VM 109 had been migrated to pvemini; qm start returned an error
that was hidden by the redirect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VM 201 (Windows critical) stays out of HA by design. Added:
- failover-vm201.sh: interactive failover pvemini -> pveelite with ZFS replication state
- recover-vm201-to-pvemini.sh: interactive reverse migration with uptime + split-brain checks
- pvemini-down-alert.sh: cron watchdog on pveelite, emails full runbook after 2min DOWN
Replication RPO tightened: CT 108 + VM 201 to 5min, CT 171 to 15min.
CT 171 added to HA (ha-group-main) for continuous Claude Code access.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Following the 2026-04-20 cluster outage, the cluster README now covers
HA resource limits, corosync token tuning (10s tolerance for USB glitches),
rasdaemon/netconsole/kdump diagnostic stack on pvemini, mail relay via
mail.romfast.ro with SMTP auth, OOM alerting via cron, and swap on pveelite.
VM 109 README now clearly states it was removed from HA and is only
started by the weekly DR test script.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The DR test script used set -euo pipefail, so a failing SSH
shutdown command caused the script to exit before qm stop.
On 2026-04-20 this left VM 109 running for 2.5 days and
triggered an OOM cascade when pvemini HA-failed over to
pveelite.
Adds EXIT trap that force-stops VM 109 regardless of exit
path, and makes the Step 7 SSH shutdown tolerant of failure.
Incident details: proxmox/cluster/incidents/2026-04-20-cluster-outage.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ExecuteScriptOS.prc: runs PowerShell scripts via DBMS_SCHEDULER
- UpdateSQLPLUS.prc: runs SQL*Plus scripts via DBMS_SCHEDULER
- find_oracle_locations.sql: comprehensive script to discover all Oracle DB paths for backup/migration
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add section explaining the root cause (IPVS broken in LXC), the
solution (dnsrr endpoint mode), and the dokploy-dnsrr-fix systemd
service that auto-applies the fix on every Dokploy deployment.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Modificări după upgrade MoltBot → OpenClaw:
- RAM crescut de la 2GB la 4GB (minim recomandat pentru OpenClaw)
- Versiune actualizată: OpenClaw v2026.2.9 (fost MoltBot v2026.1.24-3)
- Adăugat troubleshooting pentru OOM kill issues
- Curățate sesiuni vechi (85 → 80)
Problema rezolvată: Gateway-ul era omorât de OOM killer din cauza
memoriei insuficiente (975MB peak cu doar 2GB RAM total).
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Create proxmox/lxc110-moltbot/ with complete README documentation
- MoltBot AI chatbot with Telegram and WhatsApp channels
- Claude Opus 4.5 model integration via Anthropic API
- Security: dedicated moltbot user, UFW firewall, fail2ban, Tailscale SSH
- Gateway on port 18789 (loopback), token+password auth
- Update proxmox/README.md with LXC 110 quick start and navigation
- Update CLAUDE.md network layout with MoltBot entry
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- New-OracleDirectory: Improved verification with direct SQL check, preserves
existing DMPDIR path instead of blindly recreating
- Get-DatafilePath: Better fallback logic using ORACLE_HOME to derive path,
no longer hardcodes C:\app\oracle
- grants-public.sql: Fixed DMPDIR creation - now preserves existing path
instead of overriding with wrong D:\Oracle\admin\ORCL\dpdump
- config.example.ps1: Added DATAFILE_DIR parameter with documentation
These fixes ensure scripts work without manual intervention on fresh Oracle XE
installations where default DMPDIR points to non-existent paths.
Tested on VM 302 - full installation (01-08) now completes successfully.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- New-OracleDirectory now checks if DMPDIR exists with wrong path
- If path differs from target, drops and recreates the directory
- Fixes Oracle XE issue where DMPDIR defaults to D:\Oracle\admin\ORCL\dpdump
- Added VM302-TESTING.md with complete testing workflow documentation
- Includes Proxmox VM management commands, troubleshooting, and deployment steps
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The script had inline SQL that was missing 20 synonyms compared to
synonyms-public.sql, causing PACK_DEF and other packages to fail with
missing synonym errors (SYN_VNOM_UM_ISO, SYN_ATAS_*, SYN_SAL_*, etc.).
Changes:
- Remove all inline SQL (~350 lines)
- Now runs synonyms-public.sql (81 synonyms vs 61 before)
- Now runs grants-public.sql for all grants and ACL
- Add verification of SESIUNE context
This ensures the script stays in sync with the SQL files and
prevents future desync issues.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Key fixes:
- Add Run.cmd/RunAll.cmd wrappers with ExecutionPolicy Bypass
- Add Get-ListenerHost() to auto-detect listener IP address
- Fix impdp connection using EZConnect format (host:port/service)
- Add parallel=1 for Oracle XE compatibility
- Fix Write-Log to accept empty strings with [AllowEmptyString()]
- Fix Get-SchemaObjectCount regex for Windows line endings (\r\n)
- Fix path comparison for DMP file copy operation
- Add GRANT EXECUTE ON SYS.AUTH_PACK TO PUBLIC for PACK_DREPTURI
- Fix VAUTH_SERII view to use SYN_NOM_PROGRAME (has DENUMIRE column)
- Add sections 10-11 to grants-public.sql for SYS object grants
Tested on VM 302 (10.0.20.130) with Oracle XE 21c.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Increase wait time from 10s to max 60s after listener restart
- Add active polling every 5s to check if service is registered
- Log progress while waiting for service registration
- Fixes race condition where script proceeds before service is ready
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- run-all-sys.sql: Master script that executes sys-objects.sql,
sys-grants.sql, and any scripts in sys-updates/ folder in order
- sys-grants.sql: Grants EXECUTE on AUTH_PACK, DBMS_SCHEDULER,
DBMS_LOCK, UTL_* packages to CONTAFIN_ORACLE; creates public
synonyms for SYS procedures; creates DMPDIR directory
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename $Host parameter to $DbHost in oracle-functions.ps1 (Invoke-SqlPlus,
Test-OracleConnection, Get-OracleVersion, Test-PDB, Get-ServiceName)
- Update all function calls in 01-setup-database.ps1 to use -DbHost instead of -Host
- Fix ${Host} -> ${DbHost} in log message (line 147)
- Fix Write-Log "" -> Write-Host "" to avoid empty string parameter error
- Add DbHost/Port parameters and config.ps1 support to setup script
- Update sys-updates/README.md to clarify folder is for future patches only
Tested successfully on ROACENTRAL (10.0.20.130) with Oracle XE 21c.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Added roa_kill_user_sessions helper procedure
- Kill all active sessions BEFORE attempting DROP USER
- Improved company user detection (also checks for synonyms to CONTAFIN_ORACLE)
- Added more Oracle 21c internal users to exclusion list
- Better error handling and output messages
- Helper procedure auto-cleanup at end
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Original Oracle 10g R1/R2 setup scripts and SQL migrations from 2007-2026.
Preserved as reference for understanding ROA database structure and
historical schema evolution.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
PowerShell scripts for setting up Oracle 21c/XE with ROA application:
- Automated tablespace, user creation and imports
- sqlnet.ora config for Instant Client 11g/ODBC compatibility
- Oracle 21c read-only Home path handling (homes/OraDB21Home1)
- Listener restart + 10G password verifier for legacy auth
- Tested on VM 302 with CONTAFIN_ORACLE schema import
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add config/sqlnet.ora with ALLOWED_LOGON_VERSION=8 for old client support
- Add scripts/fix-sqlnet.sh startup script to persist config across container restarts
- Update README with ORA-28040 troubleshooting, ODBC connection params, and deployment instructions
- Fix SID description: Oracle 18c has PDB (XEPDB1), not non-CDB
- Update container recreation instructions with startup scripts volume
Resolves ORA-28040: No matching authentication protocol when connecting
from Windows ODBC with Oracle Instant Client 11.2
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Rename proxmox/claude-agent/ to proxmox/lxc171-claude-agent/
- Move scripts to scripts/ subdirectory
- Add complete installation guide for new LXC from scratch
- Update proxmox/README.md with LXC 171 documentation and navigation
- Add LXC 171 to containers table
- Remove .serena/project.yml
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move oracle/migration-scripts/ to proxmox/lxc108-oracle/migration/
- Move oracle/roa/ and oracle/roa-romconstruct/ to proxmox/lxc108-oracle/sql/
- Move oracle/standby-server-scripts/ to proxmox/vm109-windows-dr/
- Move chatbot/ to proxmox/lxc104-flowise/
- Update proxmox/README.md with new structure and navigation
- Update all documentation with correct directory references
- Remove unused input/claude-agent-sdk/ files
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create cluster/ for Proxmox cluster infrastructure (SSH guide, HA monitor, UPS)
- Create lxc108-oracle/ for Oracle Database documentation and scripts
- Create vm201-windows/ for Windows 11 VM docs and SSL certificate scripts
- Add SSL certificate monitoring scripts (check-ssl-certificates.ps1, monitor-ssl-certificates.sh)
- Remove archived VM107 references (decommissioned)
- Update all cross-references between files
- Update main README.md with new structure and navigation
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix permission denied on log files (chown nut:nut)
- Fix upssched.conf permissions (root:nut)
- Add sudo for perl to allow PVE::Notify from user nut
- Add periodic battery status emails every minute when on battery
- Add charging status emails at 5, 10, 30 min after power restore
- Remove diacritics from all notification messages
- Update documentation with sudo and permissions setup
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add email notifications via PVE::Notify for all UPS events:
- ONBATT: when UPS switches to battery
- ONLINE: when power is restored
- LOWBATT: critical battery level
- SHUTDOWN_START/NODE/PRIMARY: during cluster shutdown
- COMMBAD: communication lost with UPS
- Add automatic UPS shutdown command after cluster shutdown
(protects against power surge when power returns)
- Update upssched.conf with ONLINE handler and immediate ONBATT notification
- Add notification templates for HTML and text emails
- Update documentation with new features and timer configuration
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Storage Configuration improvements:
- Add "Noduri" column showing which nodes have access to each storage
- Clarify that 'local' is separate on each node (non-shared)
- Clarify that 'local-zfs' is shared across pvemini, pve1, pveelite
- Clarify that 'backup' is only on pvemini (10.0.20.201)
- Add detailed explanations for each storage type
- Add storage paths section with important locations
Node name corrections:
- Fix node name: pve2 → pveelite (correct cluster name)
- Update all references across proxmox-ssh-guide.md and README.md
- Add node descriptions in tables for clarity
Benefits:
- Users now know exactly which storage is available on which nodes
- Clear distinction between shared and non-shared storage
- Correct node naming throughout documentation
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Move oracle/CONEXIUNI-ORACLE.md → proxmox/oracle-database-lxc108.md
- Create proxmox/README.md as documentation index
- Update proxmox-ssh-guide.md:
* Remove VM 107 references (decommissioned)
* Update LXC and VM tables with IP addresses
* Add IP address map for all services
* Simplify Oracle section (detailed info in oracle-database-lxc108.md)
* Update backup job configuration
Benefits:
- All infrastructure docs in proxmox/ directory
- Clear separation: general Proxmox (proxmox-ssh-guide.md) vs Oracle-specific (oracle-database-lxc108.md)
- No duplicate information between files
- Easy navigation with README.md index
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Fixed two critical issues with HA monitoring:
1. False positive quorum errors - corosync-quorumtool not in cron PATH
2. Unwanted cron emails from PVE::Notify INFO messages to STDERR
Changes:
- Set proper PATH including /usr/sbin for corosync-quorumtool
- Split notification code: verbose shows all, non-verbose redirects STDERR to /dev/null
- Prevents cron from sending duplicate notification emails
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add comprehensive Oracle backup and DR strategy documentation
- Add RMAN backup scripts (full and incremental)
- Add PowerShell transfer scripts for DR site
- Add bash restore and verification scripts
- Reorganize Oracle documentation structure
- Add Proxmox troubleshooting guide for VM 201 HA errors and NFS storage issues
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
The ups-shutdown-cluster.sh script was missing LXC container shutdown
functionality, only shutting down VMs. This could leave containers
running during UPS power failure, causing ungraceful shutdown.
Changes:
- Added Step 2: LXC container shutdown on all cluster nodes
- Uses 'pct list' to find running containers
- Shuts down each container with 60s timeout
- Parallel shutdown with '&' for speed
- Both local (pvemini) and remote nodes (pve1, pveelite)
- Updated step numbers (now 6 steps total vs 5 before)
- Fixed log_message() to use dynamic timestamp
- Fixed node name comment (pve2 → pveelite)
Shutdown order:
1. VMs on all nodes (timeout 60s)
2. LXC containers on all nodes (timeout 60s) [NEW]
3. Wait 90 seconds for graceful shutdown
4. Secondary nodes shutdown (pve1, pveelite)
5. Wait 30 seconds
6. Primary node shutdown (pvemini)
This matches the behavior in ups-maintenance-shutdown.sh which already
had LXC support.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Adds complete procedure for replacing UPS battery when entire cluster
is powered by the same UPS, requiring full cluster shutdown.
New files:
- scripts/ups-maintenance-shutdown.sh: Automated orchestrated shutdown
for maintenance operations with confirmation prompts and progress display
- docs/UPS-BATTERY-REPLACEMENT.md: Complete step-by-step guide for battery
replacement including pre-shutdown, physical replacement, and post-startup
verification procedures
Features:
- Orchestrated shutdown: VMs → LXC containers → secondary nodes → primary
- Interactive confirmation before shutdown
- Color-coded progress indicators
- Countdown timers for each phase
- Post-replacement verification checklist
- Troubleshooting guide for common issues
- Recovery procedures for cluster/quorum problems
The procedure accounts for all 3 cluster nodes (pve1, pvemini, pveelite)
being on the same UPS, requiring complete infrastructure shutdown.
Documentation includes:
- When to replace battery (based on monthly test results)
- Pre-planning and user notification templates
- Physical battery replacement safety procedures
- Cluster recovery and VM restart procedures
- Post-replacement testing and verification
- 24-hour and 1-week monitoring checklists
Estimated maintenance window: 30-60 minutes
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
This commit adds a comprehensive UPS monitoring and management system for
the Proxmox cluster with automated shutdown orchestration and monthly
battery health testing.
Features:
- NUT (Network UPS Tools) configuration for INNO TECH USB UPS
- Automated cluster shutdown on power failure (3-minute grace period)
- Monthly automated battery testing with health evaluation
- Email notifications via PVE::Notify system
- WinNUT monitoring client for Windows VM 201
Components added:
- config/: NUT configuration files (ups.conf, upsd.conf, upsmon.conf, etc.)
- scripts/ups-shutdown-cluster.sh: Orchestrated cluster shutdown
- scripts/ups-monthly-test.sh: Monthly battery test with email reports
- scripts/upssched-cmd: Event handler for UPS state changes
- docs/: Complete installation and usage documentation
Key findings:
- UPS battery.charge reporting has 10-40 second delay after test start
- Test must monitor voltage drop (1.5-2V) and charge drop (9-27%)
- Battery health evaluation: EXCELLENT/GOOD/FAIR/POOR based on discharge rate
- Email notifications use Handlebars templates without Unicode emojis for compatibility
Configuration:
- UPS: INNO TECH (Voltronic protocol, vendor 0665:5161)
- Primary node: pvemini (10.0.20.201) with USB connection
- Monthly test: cron 0 0 1 * * /opt/scripts/ups-monthly-test.sh
- Shutdown timer: 180 seconds on battery before cluster shutdown
Documentation includes complete installation guides for NUT server,
WinNUT client, and troubleshooting procedures.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>