OCR API - Setup¶
This guide covers environment setup, dependencies, Docker configuration, and running the OCR processing service. For details on how the OCR pipeline processes documents, see the Processing Pipelines page.
Prerequisites¶
| Tool | Version | Purpose |
|---|---|---|
| Python | >= 3.11 | Runtime |
| uv | Latest | Package manager |
| Docker | 20+ | Containerized deployment |
| ODBC Driver 18 | Latest | SQL Server connectivity |
Dependencies¶
The OCR worker's full dependency list lives in pyproject.toml. Key libraries:
| Library | Role |
|---|---|
| PaddleOCR 3.3 | Primary OCR engine -- fast processing for English text, numbers, and tables |
| EasyOCR 1.7 | Secondary OCR engine -- accurate Arabic text with correct RTL reading order |
| PyMuPDF | PDF parsing: text extraction, embedded image extraction, page-to-image rendering |
| OpenCV | Image decoding and color space conversion before OCR |
| aio-pika | Async RabbitMQ client for consuming messages from ocr_queue |
| azure-storage-blob | Downloading uploaded files from Azure Blob Storage |
| aioodbc + pyodbc | Async SQL Server access for updating processing status |
For the full database schema and table definitions, see the Database Schema page.
Environment Variables¶
Configuration is managed via Pydantic Settings in app/core/config.py, loading from a .env file. Copy the example and fill in your credentials:
The .env.example file documents every variable. The SQL_CONNECTION_STRING is computed automatically from the SQL variables:
mssql+aioodbc://{SQL_USER}:{SQL_PASS}@{SQL_SERVER}/{SQL_DB_NAME}?driver={SQL_DRIVER}&TrustServerCertificate=yes
Example .env File¶
# --- OCR Config ---
OUTPUT_DIR="documents"
GPU=false
# --- Message Broker ---
MESSAGE_BROKER_URL="amqp://guest:guest@localhost:5672/"
OCR_QUEUE_NAME="ocr_queue"
# --- SQL Server ---
SQL_SERVER="your-server.database.windows.net"
SQL_DB_NAME="your-database"
SQL_USER="your-username"
SQL_PASS="your-password"
# --- Azure Blob Storage ---
BLOB_CONNECTION_STR="DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=..."
BLOB_STORAGE_CONTAINER_NAME="your-container"
Running Locally¶
1. Install Dependencies¶
2. Ensure RabbitMQ is Running¶
If using Docker Compose from the project root:
Or install RabbitMQ locally and start it on the default port (5672).
3. Start the Service¶
First Startup is Slow
On first launch, PaddleOCR and EasyOCR will download their model weights (several hundred MB). Subsequent starts are much faster as models are cached locally.
5. Verify¶
The health endpoint returns 204 No Content when the service is ready:
Docker¶
Dockerfile Overview¶
The OCR service uses a multi-stage build to minimize the final image size:
graph LR
A[Stage 1: Builder] -->|Copy venv| B[Stage 2: Runtime]
A -->|"python:3.11-slim + build tools"| A
B -->|"python:3.11-slim + ODBC + libgl"| B
Stage 1 (Builder):
- Base:
python:3.11-slim-bookworm - Installs
build-essentialandunixodbc-devfor compiling native extensions - Copies
uvfrom the official image (ghcr.io/astral-sh/uv:0.4.0) - Creates a virtual environment at
/opt/venvand installs all dependencies
Stage 2 (Runtime):
- Base:
python:3.11-slim-bookworm - Installs Microsoft ODBC Driver 18,
libgl1, andlibglib2.0-0(required by OpenCV) - Creates a non-root user (
user14) for security - Copies the venv from the builder stage and the application code
- Mounts
/ocr/documentsas a volume for OCR output
Key Docker Configuration¶
| Setting | Value |
|---|---|
| Exposed port | 8000 |
| Healthcheck | curl -f http://localhost:8000/ every 30s |
| Start period | 300s (5 minutes -- allows model download on first run) |
| User | user14 (non-root) |
| Volume | /ocr/documents |
| Entrypoint | uvicorn app.main:app --host 0.0.0.0 --port 8000 |
Environment Variables in Docker¶
All other variables (SQL_*, BLOB_*, MESSAGE_BROKER_URL) should be passed via environment variables or a .env file. For the full Docker Compose orchestration, see the Deployment page.
Build and Run¶
# Build
docker build -t nassaq-ocr:latest ./ocr
# Run
docker run -d \
--name nassaq-ocr \
--env-file ocr/.env \
-p 8001:8000 \
-v ocr-documents:/ocr/documents \
nassaq-ocr:latest
GPU Configuration¶
EasyOCR supports GPU acceleration via CUDA. To enable it:
- Set
GPU=truein your.envfile - Replace
paddlepaddlewithpaddlepaddle-gpuinpyproject.toml - If using Docker, use an NVIDIA CUDA base image and install
nvidia-container-toolkit
CPU Mode is Default
The default configuration runs entirely on CPU. This is adequate for most document processing workloads. GPU acceleration primarily benefits batch processing of large image collections.
Application Lifecycle¶
The FastAPI application uses a lifespan context manager that handles startup and shutdown:
sequenceDiagram
participant App as FastAPI App
participant P as PaddleOCR
participant E as EasyOCR
participant B as RabbitMQ
participant S as Blob Storage
Note over App: Startup
App->>P: Load PaddleOCR engine (lang="ar")
App->>E: Load EasyOCR reader (["ar", "en"])
App->>B: Connect to RabbitMQ
App->>S: Initialize BlobDownloader
App->>B: Start consuming from ocr_queue
Note over App: Ready to process messages
Note over App: Shutdown
App->>B: Close RabbitMQ connection
App->>S: Close Blob client
Startup Sequence¶
- Load PaddleOCR -- Initializes the PaddleOCR engine with
use_angle_cls=Falseandlang="ar" - Load EasyOCR -- Creates an EasyOCR Reader for Arabic and English with GPU setting from config
- Connect RabbitMQ -- Establishes a robust connection to the message broker
- Initialize Blob Storage -- Creates the Azure Blob downloader client
- Start Consumer -- Begins consuming messages from the configured queue (
ocr_queueby default)
Shutdown Sequence¶
- Close the RabbitMQ connection
- Close the Blob Storage client
API Endpoints¶
The OCR service exposes two endpoints:
| Method | Path | Description |
|---|---|---|
GET |
/ |
Health check (returns 204 No Content) |
POST |
/api/v1/docs |
Direct file upload for OCR processing (batch mode) |
Primary Processing Path
In production, documents are primarily processed via the RabbitMQ consumer, not the REST endpoint. The backend server uploads files to Azure Blob Storage and publishes a message to ocr_queue. The OCR worker consumes the message, downloads the file, and processes it asynchronously.
The POST /api/v1/docs endpoint exists for direct testing and standalone use.