docs: Add Windows OCR dependencies and fix IIS API error handling

- Add OCR installation instructions for Windows (Poppler, Tesseract, PaddleOCR)
- Add troubleshooting section for common OCR errors on Windows
- Fix web.config.data-entry to use existingResponse="Auto" instead of "Replace"
  This allows FastAPI JSON error responses to pass through IIS unchanged
- Update system requirements to recommend 16GB RAM for OCR workloads

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2025-12-18 19:43:33 +02:00
parent 0851d01917
commit 642ae3a96c
3 changed files with 432 additions and 42 deletions

View File

@@ -125,6 +125,61 @@ dnf install -y \
**Note:** PaddleOCR (engine principal) se instaleaza automat cu pip. Tesseract este folosit ca fallback.
### OCR System Dependencies (Windows)
Pe Windows Server trebuie instalate manual urmatoarele componente:
#### 1. Poppler (pentru conversie PDF → imagini)
```powershell
# Descarca Poppler pentru Windows
# https://github.com/osborn/poppler-windows/releases
# sau https://github.com/bblanchon/pdfium-binaries
# Extrage in C:\Program Files\poppler\
# Adauga la PATH: C:\Program Files\poppler\Library\bin
```
#### 2. Tesseract OCR (engine OCR backup)
```powershell
# Descarca installer de la:
# https://github.com/UB-Mannheim/tesseract/wiki
# Instaleaza cu limbile: English + Romanian
# Default path: C:\Program Files\Tesseract-OCR\
# Adauga la PATH
```
#### 3. Python OCR Dependencies (in venv)
```powershell
cd C:\inetpub\wwwroot\roa2web\data-entry-backend
.\venv\Scripts\activate
# Instaleaza dependentele OCR
pip install paddlepaddle>=2.5.0
pip install paddleocr>=2.7.0
pip install opencv-python>=4.8.0
pip install pytesseract>=0.3.10
pip install pdf2image>=1.16.0
# Sau din requirements.txt
pip install -r requirements.txt
```
#### 4. Restart serviciu
```powershell
nssm restart ROA2WEB-DataEntry
```
**Note importante Windows:**
- Prima rulare PaddleOCR descarca modele (~200MB) - poate dura cateva minute
- PaddleOCR necesita ~2GB RAM disponibil
- Verifica PATH-ul pentru Poppler si Tesseract dupa instalare
- Restart serviciul backend dupa orice modificare PATH
### OCR API Endpoints
| Method | Endpoint | Description |
@@ -270,6 +325,49 @@ Full API documentation available at http://localhost:8003/docs when backend is r
2. Install system dependencies (tesseract, poppler)
3. Verify PaddleOCR installed: `python -c "from paddleocr import PaddleOCR"`
### OCR Windows - "poppler not in PATH"
```powershell
# Eroare: "Unable to get page count. Is poppler installed and in PATH?"
# Solutie 1: Adauga Poppler la PATH
# System Properties → Environment Variables → System variables → Path → New
# Adauga: C:\Program Files\poppler\Library\bin
# Solutie 2: Restart serviciul dupa modificarea PATH
nssm restart ROA2WEB-DataEntry
# Verificare:
pdfinfo --version
```
### OCR Windows - "tesseract not found"
```powershell
# Eroare: "tesseract is not installed or it's not in your PATH"
# Solutie: Adauga Tesseract la PATH
# C:\Program Files\Tesseract-OCR\
# Verificare:
tesseract --version
tesseract --list-langs # Trebuie sa arate 'ron' si 'eng'
```
### OCR Windows - PaddleOCR import error
```powershell
# Eroare: "No module named 'paddleocr'"
cd C:\inetpub\wwwroot\roa2web\data-entry-backend
.\venv\Scripts\activate
pip install paddlepaddle>=2.5.0
pip install paddleocr>=2.7.0
# Restart serviciu
nssm restart ROA2WEB-DataEntry
```
### Low OCR accuracy
- Ensure good lighting when taking receipt photos