# VM 109 - Oracle DR System (Windows Standby)

**Director Proxmox:** `proxmox/vm109-windows-dr/`
**VMID:** 109
**Rol:** Disaster Recovery pentru Oracle Database (backup RMAN de pe server Windows extern)

## ⚠️ Important — Topologie după 2026-04-25

VM 109 trăiește pe **pveelite** (10.0.20.202), co-located cu storage-ul NFS Oracle backups. Configurația post-incident 04-20:

- **VM 109 în HA, grup `ha-prefer-pveelite`** (pveelite=100, pvemini=50, pve1=10), `state=stopped`, `nofailback=1` — HA face failover dacă pveelite cade dar nu repornește VM 109 automat (rămâne stopped, scriptul DR îl pornește săptămânal).
- **Apărări împotriva incident 04-20**:
  - `trap cleanup_vm EXIT` în scriptul DR (commit 8a0c557) cu guard `DR_VM_STARTED_BY_US` (commit 2e8cd9c) — oprește VM 109 doar dacă scriptul l-a pornit.
  - `vm109-watchdog.sh` cron pe ambele pveelite + pvemini (cluster-aware) — oprește forțat VM 109 dacă rulează > 60 min în afara ferestrei test (Sâmbătă 05:55-07:30). Debug exempt: `touch /var/run/vm109-debug.flag`.
  - Pre-flight check în DR script: refuză `qm start 109` dacă cluster degraded sau memorie disponibilă < (VM 109 mem + 1 GB margin).
  - `max_restart=3, max_relocate=2` pe toate serviciile HA — cap pe restart loops la OOM.

**Verificare status:**
```bash
ssh root@10.0.20.201 "ha-manager status | grep -E '109|201|108'"
ssh root@10.0.20.202 "qm status 109"          # trebuie stopped între teste
```

## 🔄 Storage Failover (pveelite → pvemini)

`/mnt/pve/oracle-backups` e dataset ZFS replicat pveelite → pvemini la 15 min (`zfs-replicate-oracle-backups.sh`) + mirror nightly pe pve1 backup-ssd (`nightly-backup-mirror.sh`). La pveelite down:

1. **Email automat** din `pveelite-down-alert.sh` (cron pe pvemini, prag 5 min) cu instrucțiuni de failover copy-paste.
2. Operator rulează pe pvemini: `/opt/scripts/failover-dr-to-pvemini.sh` — promote ZFS readonly → off, configurează NFS export, patch primary Oracle scheduled task IP via SSH.
3. Când pveelite revine: `/opt/scripts/failback-dr-to-pveelite.sh` — invers, cu zfs send incremental + restaurare config.

Script-urile refuză să ruleze dacă cealaltă parte e accesibilă (anti-split-brain).

---

# 🛡️ Oracle DR System - Complete Architecture

## 📊 System Overview

```
┌─────────────────────────────────────────────────────────────────┐
│                     PRODUCTION ENVIRONMENT                       │
├─────────────────────────────────────────────────────────────────┤
│  PRIMARY SERVER (10.0.20.36)                                    │
│  Windows Server + Oracle 19c                                     │
│  ┌──────────────────────────────┐                              │
│  │ Database: ROA                 │                              │
│  │ Size: ~80 GB                  │                              │
│  │ Tables: 42,625                │                              │
│  └──────────────────────────────┘                              │
│         │                                                        │
│         ▼ Backups (Daily)                                       │
│  ┌──────────────────────────────┐                              │
│  │ 02:30 - FULL backup (6-7 GB) │                              │
│  │ 13:00 - CUMULATIVE (200 MB)  │                              │
│  │ 18:00 - CUMULATIVE (300 MB)  │                              │
│  └──────────────────────────────┘                              │
└─────────────────────────────────────────────────────────────────┘
                    │
                    │ SSH Transfer (Port 22)
                    ▼
┌─────────────────────────────────────────────────────────────────┐
│                        DR ENVIRONMENT                            │
├─────────────────────────────────────────────────────────────────┤
│  PROXMOX HOST (10.0.20.202 - pveelite)                         │
│  ┌──────────────────────────────┐                              │
│  │ Backup Storage (NFS Server)   │◄─────── Monitoring Scripts  │
│  │ /mnt/pve/oracle-backups/      │         /opt/scripts/       │
│  │ └── ROA/autobackup/           │                              │
│  └──────────────────────────────┘                              │
│         │                                                        │
│         │ NFS Mount (F:\)                                       │
│         ▼                                                        │
│  ┌──────────────────────────────┐                              │
│  │ DR VM 109 (10.0.20.37)       │                              │
│  │ Windows Server + Oracle 19c   │                              │
│  │ Status: OFF (normally)        │                              │
│  │ Starts for: Tests or Disaster │                              │
│  └──────────────────────────────┘                              │
└─────────────────────────────────────────────────────────────────┘
```

## 🎯 Quick Actions

### ⚡ Emergency DR Activation (Production Down!)

```bash
# 1. Start DR VM
ssh root@10.0.20.202 "qm start 109"

# 2. Connect to VM (wait 3 min for boot)
ssh -p 22122 romfast@10.0.20.37

# 3. Run restore (takes ~10-15 minutes)
D:\oracle\scripts\rman_restore_from_zero.cmd

# 4. Database is now RUNNING - Update app connections to 10.0.20.37
```

### 🔄 Failback DR → PRIMARY (when production is repaired)

Procedura inversă, pentru când serverul de producție a fost reparat sau reinstalat și
trebuie mutată producția înapoi pe `10.0.20.36`:

> ⚠️ **Pe PRIMARY instalează Oracle 19c (NU 21c) pentru failback acut.** Backup-urile sunt 19.3. 21c poate restore tehnic, dar cere upgrade-of-dictionary suplimentar (~30-60 min în plus) — risc inutil în fereastra de criză. Migrarea la 21c se face separat după failback. Detalii în `FAILBACK_PROCEDURE.md`.

➡️ Vezi **[docs/FAILBACK_PROCEDURE.md](docs/FAILBACK_PROCEDURE.md)** — pași end-to-end:
- Backup final pe DR (cu DB în read-only / restricted)
- Restore pe PRIMARY nou cu `scripts/rman_restore_to_primary.ps1`
- Switch connection strings + reactivare scheduled tasks RMAN
- Stop VM 109, revenire la state normal

### 🧪 Weekly Test (Every Saturday)

```bash
# Automatic at 06:00 via cron, or manual:
ssh root@10.0.20.202 "/opt/scripts/weekly-dr-test-proxmox.sh"

# What it does:
# ✓ Starts VM → Restores DB → Tests → Cleanup → Shutdown
# ✓ Sends email report with results
```

### 📊 Check Backup Health

```bash
# Manual check (runs daily at 09:00 automatically)
ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh"

# Output:
# Status: OK
# FULL backup age: 11 hours ✓
# CUMULATIVE backup age: 2 hours ✓
# Disk usage: 45% ✓
```

## 🗂️ Component Locations

### 📁 PRIMARY Server (10.0.20.36)
```
D:\rman_backup\
├── rman_backup_full.txt          # RMAN script for FULL backup
├── rman_backup_incremental.txt   # RMAN script for CUMULATIVE
└── transfer_backups.ps1          # UNIFIED: Transfer ALL backups to Proxmox

Scheduled Tasks:
├── 02:30 - Oracle RMAN Full Backup
├── 03:00 - Transfer backups to DR (transfer_backups.ps1)
├── 13:00 - Oracle RMAN Cumulative Backup
├── 14:45 - Transfer backups to DR (transfer_backups.ps1)
└── 18:00 - Oracle RMAN Cumulative Backup
```

### 📁 PROXMOX Host (10.0.20.202)
```
/opt/scripts/
├── oracle-backup-monitor-proxmox.sh  # Daily backup monitoring
├── weekly-dr-test-proxmox.sh         # Weekly DR test
└── PROXMOX_NOTIFICATIONS_README.md   # Documentation

/mnt/pve/oracle-backups/ROA/autobackup/
├── FULL_20251010_023001.BKP         # Latest FULL backup
├── INCR_20251010_130001.BKP         # CUMULATIVE 13:00
└── INCR_20251010_180001.BKP         # CUMULATIVE 18:00

Cron Jobs:
0 9 * * * /opt/scripts/oracle-backup-monitor-proxmox.sh
0 6 * * 6 /opt/scripts/weekly-dr-test-proxmox.sh
```

### 📁 DR VM 109 (10.0.20.37) - When Running
```
D:\oracle\scripts\
├── rman_restore_from_zero.cmd    # Main restore script ⭐
├── cleanup_database.cmd          # Cleanup after test
└── mount-nfs.bat                 # Mount F:\ at startup

F:\ (NFS mount from Proxmox)
└── ROA\autobackup\               # All backup files
```

## 🔄 How It Works

### Backup Flow (Daily)
```
PRIMARY                         PROXMOX
   │                               │
   ├─02:30─FULL─Backup─────────────►
   │         (6-7 GB)               │
   ├─03:00─Transfer ALL────────────► Skip duplicates
   │      (transfer_backups.ps1)   │
   │                               │
   ├─13:00─CUMULATIVE──────────────►
   │         (200 MB)               │
   ├─14:45─Transfer ALL────────────► Skip duplicates
   │      (transfer_backups.ps1)   │ (only new files)
   │                               │
   └─18:00─CUMULATIVE──────────────►
             (300 MB)            Storage
                                    │
                             ┌──────────┐
                             │ Monitor  │ 09:00 Daily
                             │ Check Age│ Alert if old
                             └──────────┘
```

### Restore Process
```
Start VM → Mount F:\ → Copy Backups → RMAN Restore → Database OPEN
  2min      Auto         2min           8min           Ready!

Total Time: ~15 minutes
```

## 🔧 Manual Operations

### Test Individual Components

```bash
# 1. Test backup transfer (on PRIMARY)
powershell -ExecutionPolicy Bypass -File "D:\rman_backup\transfer_backups.ps1"

# 2. Test NFS mount (on VM 109)
mount -o rw,nolock,mtype=hard,timeout=60 10.0.20.202:/mnt/pve/oracle-backups F:
dir F:\ROA\autobackup

# 3. Test notification system
ssh root@10.0.20.202 "touch -d '2 days ago' /mnt/pve/oracle-backups/ROA/autobackup/*FULL*.BKP"
ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh"
# Should send WARNING notification

# 4. Test database restore (on VM 109)
D:\oracle\scripts\rman_restore_from_zero.cmd
```

### Force Actions

```bash
# Force backup now (on PRIMARY)
rman cmdfile=D:\rman_backup\rman_backup_incremental.txt

# Force cleanup VM (on VM 109)
D:\oracle\scripts\cleanup_database.cmd

# Force VM shutdown
ssh root@10.0.20.202 "qm stop 109"
```

## 🐛 Troubleshooting

### 🔍 Debugging Restore Tests

#### Check Backup Files on Proxmox (10.0.20.202)

```bash
# 1. List all backup files with size and date
ssh root@10.0.20.202 "ls -lht /mnt/pve/oracle-backups/ROA/autobackup/*.BKP"

# 2. Count backup files
ssh root@10.0.20.202 "ls /mnt/pve/oracle-backups/ROA/autobackup/*.BKP | wc -l"

# 3. Check latest backups (last 24 hours)
ssh root@10.0.20.202 "find /mnt/pve/oracle-backups/ROA/autobackup -name '*.BKP' -mtime -1 -ls"

# 4. Show backup files grouped by type (with new naming convention)
ssh root@10.0.20.202 "ls -lh /mnt/pve/oracle-backups/ROA/autobackup/ | grep -E '(L0_|L1_|ARC_|SPFILE_|CF_|O1_MF)'"

# 5. Check disk space usage
ssh root@10.0.20.202 "df -h /mnt/pve/oracle-backups"
ssh root@10.0.20.202 "du -sh /mnt/pve/oracle-backups/ROA/autobackup/"

# 6. Verify newest backup timestamp
ssh root@10.0.20.202 "stat /mnt/pve/oracle-backups/ROA/autobackup/L0_*.BKP 2>/dev/null | grep Modify || echo 'No L0 backups with new naming'"
```

#### Verify Backup Files on DR VM (when running)

```powershell
# 1. Check NFS mount is accessible
Test-Path F:\ROA\autobackup

# 2. List all backup files
Get-ChildItem F:\ROA\autobackup\*.BKP | Format-Table Name, Length, LastWriteTime

# 3. Count backup files
(Get-ChildItem F:\ROA\autobackup\*.BKP).Count

# 4. Show total backup size
"{0:N2} GB" -f ((Get-ChildItem F:\ROA\autobackup\*.BKP | Measure-Object -Property Length -Sum).Sum / 1GB)

# 5. Check latest Level 0 backup
Get-ChildItem F:\ROA\autobackup\L0_*.BKP -ErrorAction SilentlyContinue | Sort-Object LastWriteTime -Descending | Select-Object -First 1

# 6. Check what was copied during last restore
Get-Content D:\oracle\logs\restore_from_zero.log | Select-String "Copying|Copied"
```

#### Check DR Test Results

```bash
# 1. View latest DR test log
ssh root@10.0.20.202 "ls -lt /var/log/oracle-dr/dr_test_*.log | head -1 | awk '{print \$9}' | xargs cat | tail -100"

# 2. Check test status (passed/failed)
ssh root@10.0.20.202 "grep -E 'PASSED|FAILED|Database Verification' /var/log/oracle-dr/dr_test_*.log | tail -5"

# 3. See backup selection logic output
ssh root@10.0.20.202 "grep -A5 'TEST MODE: Selecting' /var/log/oracle-dr/dr_test_*.log | tail -20"

# 4. Check how many files were selected
ssh root@10.0.20.202 "grep 'Total files selected' /var/log/oracle-dr/dr_test_*.log | tail -1"

# 5. View RMAN errors (if any)
ssh root@10.0.20.202 "grep -i 'RMAN-\|ORA-' /var/log/oracle-dr/dr_test_*.log | tail -20"
```

#### Simulate Test Locally (on DR VM)

```powershell
# 1. Start Oracle service manually
Start-Service OracleServiceROA

# 2. Run cleanup to prepare for restore
D:\oracle\scripts\cleanup_database.ps1 /SILENT

# 3. Run restore in test mode
D:\oracle\scripts\rman_restore_from_zero.ps1 -TestMode

# 4. Verify database opened correctly
sqlplus / as sysdba @D:\oracle\scripts\verify_db.sql

# 5. Check what backups were used
Get-Content D:\oracle\logs\restore_from_zero.log | Select-String "backup piece"

# 6. View database verification output
Get-Content D:\oracle\logs\restore_from_zero.log | Select-String -Pattern "DB_NAME|OPEN_MODE|TABLES" -Context 0,1
```

#### Common Restore Test Issues

| Issue | Check | Fix |
|-------|-------|-----|
| Test reports FAILED but DB is open | Check log for "OPEN_MODE: READ WRITE" | Already fixed in latest version |
| Missing datafiles in restore | Count backup files: should be 15-40+ | Wait for next full backup or copy all files |
| "No backups found" error | Verify NFS mount: `Test-Path F:\` | Remount NFS or check Proxmox NFS service |
| Restore takes > 30 min | Check backup size: should be ~5-8 GB | Normal for first restore after format change |
| RMAN-06023 errors | Check for L0_*.BKP files on F:\ | Old format: need new backup with naming convention |

#### Verify Naming Convention is Active

```bash
# Check if new naming convention is being used (after Oct 11, 2025)
ssh root@10.0.20.202 "ls /mnt/pve/oracle-backups/ROA/autobackup/ | grep -E '^(L0_|L1_|ARC_|SPFILE_|CF_)' | wc -l"
# Should return > 0 if active

# If 0, backups are still using old format (O1_MF_ANNNN_*)
# Wait for next scheduled backup (02:30 daily) or run manual backup
```

#### Manual Test Run with Verbose Output

```bash
# Run test with full output visible
ssh root@10.0.20.202
cd /opt/scripts
./weekly-dr-test-proxmox.sh 2>&1 | tee /tmp/dr_test_manual.log

# Watch in real-time what's happening
# Look for these key stages:
# - "TEST MODE: Selecting latest backup set"
# - "Total files selected: XX"
# - "RMAN restore completed successfully"
# - "OPEN_MODE: READ WRITE"
```

### ❌ Backup Monitor Not Sending Alerts

```bash
# 1. Check templates exist
ssh root@10.0.20.202 "ls /usr/share/pve-manager/templates/default/oracle-*"

# 2. Reinstall templates
ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh --install"

# 3. Check Proxmox notifications work
ssh root@10.0.20.202 "pvesh create /nodes/$(hostname)/apt/update"
# Should receive update notification
```

### ❌ F:\ Drive Not Accessible in VM

```bash
# On VM 109:
# 1. Check NFS Client service
Get-Service | Where {$_.Name -like "*NFS*"}

# 2. Manual mount
mount -o rw,nolock,mtype=hard,timeout=60 10.0.20.202:/mnt/pve/oracle-backups F:

# 3. Check Proxmox NFS server
ssh root@10.0.20.202 "showmount -e localhost"
# Should show: /mnt/pve/oracle-backups 10.0.20.37
```

### ❌ Restore Fails

```bash
# 1. Check backup files exist
dir F:\ROA\autobackup\*.BKP

# 2. Check Oracle service
sc query OracleServiceROA

# 3. Check PFILE exists
dir C:\Users\oracle\admin\ROA\pfile\initROA.ora

# 4. View restore log
type D:\oracle\logs\restore_from_zero.log
```

### ❌ VM Won't Start

```bash
# Check VM status
ssh root@10.0.20.202 "qm status 109"

# Check VM config
ssh root@10.0.20.202 "qm config 109 | grep -E 'memory|cores|bootdisk'"

# Force unlock if locked
ssh root@10.0.20.202 "qm unlock 109"

# Start with console
ssh root@10.0.20.202 "qm start 109 && qm terminal 109"
```

## 📈 Monitoring & Metrics

### Key Metrics
| Metric | Target | Alert Threshold |
|--------|--------|-----------------|
| FULL Backup Age | < 24h | > 25h |
| CUMULATIVE Age | < 6h | > 7h |
| Backup Size | ~7 GB/day | > 10 GB |
| Restore Time | < 15 min | > 30 min |
| Disk Usage | < 80% | > 80% |

### Check Logs

```bash
# Backup logs (on PRIMARY)
Get-Content D:\rman_backup\logs\backup_*.log -Tail 50

# Transfer logs (on PRIMARY) - UNIFIED script
Get-Content D:\rman_backup\logs\transfer_*.log -Tail 50

# Monitoring logs (on Proxmox)
tail -50 /var/log/oracle-dr/*.log

# Restore logs (on VM 109)
type D:\oracle\logs\restore_from_zero.log
```

## 🔐 Security & Access

### SSH Keys Setup
```
PRIMARY (10.0.20.36) ──────► PROXMOX (10.0.20.202)
                      SSH Key
                      Port 22

LINUX WORKSTATION ─────────► PROXMOX (10.0.20.202)
                      SSH Key
                      Port 22

LINUX WORKSTATION ─────────► VM 109 (10.0.20.37)
                      SSH Key
                      Port 22122
```

### Required Credentials
- **PRIMARY**: Administrator (for scheduled tasks)
- **PROXMOX**: root (for scripts and VM control)
- **VM 109**: romfast (user), SYSTEM (Oracle service)

## 📅 Maintenance Schedule

| Day | Time | Action | Duration | Impact |
|-----|------|--------|----------|--------|
| Daily | 02:30 | FULL Backup | 30 min | None |
| Daily | 09:00 | Monitor Backups | 1 min | None |
| Daily | 13:00 | CUMULATIVE Backup | 5 min | None |
| Daily | 18:00 | CUMULATIVE Backup | 5 min | None |
| Saturday | 06:00 | DR Test | 30 min | None |

## 🚨 Disaster Recovery Procedure

### When PRIMARY is DOWN:

1. **Confirm PRIMARY is unreachable**
   ```bash
   ping 10.0.20.36  # Should fail
   ```

2. **Start DR VM**
   ```bash
   ssh root@10.0.20.202 "qm start 109"
   ```

3. **Wait for boot (3 minutes)**

4. **Connect to DR VM**
   ```bash
   ssh -p 22122 romfast@10.0.20.37
   ```

5. **Run restore**
   ```cmd
   D:\oracle\scripts\rman_restore_from_zero.cmd
   ```

6. **Verify database**
   ```sql
   sqlplus / as sysdba
   SELECT name, open_mode FROM v$database;
   -- Should show: ROA, READ WRITE
   ```

7. **Update application connections**
   - Change from: 10.0.20.36:1521/ROA
   - Change to: 10.0.20.37:1521/ROA

8. **Monitor DR system**
   - Database is now production
   - Do NOT run cleanup!
   - Keep VM running

## 📝 Quick Reference Card

```
╔══════════════════════════════════════════════════════════════╗
║                    DR QUICK REFERENCE                        ║
╠══════════════════════════════════════════════════════════════╣
║ PRIMARY DOWN?                                                ║
║ ssh root@10.0.20.202                                        ║
║ qm start 109                                                 ║
║ # Wait 3 min                                                 ║
║ ssh -p 22122 romfast@10.0.20.37                            ║
║ D:\oracle\scripts\rman_restore_from_zero.cmd                ║
╠══════════════════════════════════════════════════════════════╣
║ TEST DR?                                                     ║
║ ssh root@10.0.20.202 "/opt/scripts/weekly-dr-test-proxmox.sh"║
╠══════════════════════════════════════════════════════════════╣
║ CHECK BACKUPS?                                               ║
║ ssh root@10.0.20.202 "/opt/scripts/oracle-backup-monitor-proxmox.sh"║
╠══════════════════════════════════════════════════════════════╣
║ SUPPORT:                                                     ║
║ Logs: /var/log/oracle-dr/                                   ║
║ Docs: proxmox/vm109-windows-dr/docs/                        ║
╚══════════════════════════════════════════════════════════════╝
```

---

## 📂 Structură Director

```
vm109-windows-dr/
├── README.md                           # Acest fișier
├── docs/
│   ├── PLAN_TESTARE_MONITORIZARE.md    # Plan testare și monitorizare DR
│   ├── PROXMOX_NOTIFICATIONS_README.md # Configurare notificări Proxmox
│   ├── FAILBACK_PROCEDURE.md           # Failback DR → PRIMARY (procedura inversă)
│   └── archive/                        # Planuri și statusuri anterioare
│       ├── DR_UPGRADE_TO_CUMULATIVE_PLAN.md
│       ├── DR_VM_MIGRATION_GUIDE.md
│       ├── DR_WINDOWS_VM_IMPLEMENTATION_PLAN.md
│       └── DR_WINDOWS_VM_STATUS_2025-10-09.md
└── scripts/
    ├── oracle-backup-monitor-proxmox.sh  # Monitorizare zilnică (Proxmox)
    ├── weekly-dr-test-proxmox.sh         # Test săptămânal DR (Proxmox)
    ├── rman_backup.bat                   # RMAN full backup (Windows)
    ├── rman_backup_incremental.bat       # RMAN incremental (Windows)
    ├── transfer_backups.ps1              # Transfer backup-uri (Windows)
    ├── rman_restore_from_zero.ps1        # Restore PRIMARY → DR (disaster activation)
    ├── rman_restore_to_primary.ps1       # Restore DR → PRIMARY (failback)
    ├── cleanup_database.ps1              # Cleanup după test (Windows DR)
    └── *.ps1                             # Alte scripturi configurare
```

---

**Last Updated:** 2026-01-27
**Version:** 2.2 - Unified transfer script (transfer_backups.ps1)
**Status:** ✅ Production Ready

## 📋 Changelog

### v2.2 (Oct 31, 2025)
- ✨ **Unified transfer script**: Replaced `transfer_to_dr.ps1` and `transfer_incremental.ps1` with single `transfer_backups.ps1`
- 🎯 **Smart duplicate detection**: Automatically skips files that exist on DR
- ⚡ **Flexible scheduling**: Can run after any backup type or manually
- 🔧 **Simplified maintenance**: One script to maintain instead of two

### v2.1 (Oct 11, 2025)
- Added restore test debugging guide
- Implemented new backup naming convention