OCR API - Setup¶

This guide covers environment setup, dependencies, Docker configuration, and running the OCR processing service. For details on how the OCR pipeline processes documents, see the Processing Pipelines page.

Prerequisites¶

Tool	Version	Purpose
Python	>= 3.11	Runtime
uv	Latest	Package manager
Docker	20+	Containerized deployment
ODBC Driver 18	Latest	SQL Server connectivity

Dependencies¶

The OCR worker's full dependency list lives in pyproject.toml. Key libraries:

Library	Role
PaddleOCR 3.3	Primary OCR engine -- fast processing for English text, numbers, and tables
EasyOCR 1.7	Secondary OCR engine -- accurate Arabic text with correct RTL reading order
PyMuPDF	PDF parsing: text extraction, embedded image extraction, page-to-image rendering
OpenCV	Image decoding and color space conversion before OCR
aio-pika	Async RabbitMQ client for consuming messages from `ocr_queue`
azure-storage-blob	Downloading uploaded files from Azure Blob Storage
aioodbc + pyodbc	Async SQL Server access for updating processing status

For the full database schema and table definitions, see the Database Schema page.

Environment Variables¶

Configuration is managed via Pydantic Settings in app/core/config.py, loading from a .env file. Copy the example and fill in your credentials:

cp .env.example .env

The .env.example file documents every variable. The SQL_CONNECTION_STRING is computed automatically from the SQL variables:

mssql+aioodbc://{SQL_USER}:{SQL_PASS}@{SQL_SERVER}/{SQL_DB_NAME}?driver={SQL_DRIVER}&TrustServerCertificate=yes

Example `.env` File¶

.env

# --- OCR Config ---
OUTPUT_DIR="documents"
GPU=false

# --- Message Broker ---
MESSAGE_BROKER_URL="amqp://guest:guest@localhost:5672/"
OCR_QUEUE_NAME="ocr_queue"

# --- SQL Server ---
SQL_SERVER="your-server.database.windows.net"
SQL_DB_NAME="your-database"
SQL_USER="your-username"
SQL_PASS="your-password"

# --- Azure Blob Storage ---
BLOB_CONNECTION_STR="DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=..."
BLOB_STORAGE_CONTAINER_NAME="your-container"

Running Locally¶

1. Install Dependencies¶

cd ocr
uv sync

2. Ensure RabbitMQ is Running¶

If using Docker Compose from the project root:

docker compose up rabbitmq -d

Or install RabbitMQ locally and start it on the default port (5672).

3. Start the Service¶

uv run uvicorn app.main:app --host 0.0.0.0 --port 8001

First Startup is Slow

On first launch, PaddleOCR and EasyOCR will download their model weights (several hundred MB). Subsequent starts are much faster as models are cached locally.

5. Verify¶

The health endpoint returns 204 No Content when the service is ready:

curl -f http://localhost:8001/

Docker¶

Dockerfile Overview¶

The OCR service uses a multi-stage build to minimize the final image size:

graph LR
    A[Stage 1: Builder] -->|Copy venv| B[Stage 2: Runtime]
    A -->|"python:3.11-slim + build tools"| A
    B -->|"python:3.11-slim + ODBC + libgl"| B

Stage 1 (Builder):

Base: python:3.11-slim-bookworm
Installs build-essential and unixodbc-dev for compiling native extensions
Copies uv from the official image (ghcr.io/astral-sh/uv:0.4.0)
Creates a virtual environment at /opt/venv and installs all dependencies

Stage 2 (Runtime):

Base: python:3.11-slim-bookworm
Installs Microsoft ODBC Driver 18, libgl1, and libglib2.0-0 (required by OpenCV)
Creates a non-root user (user14) for security
Copies the venv from the builder stage and the application code
Mounts /ocr/documents as a volume for OCR output

Key Docker Configuration¶

Setting	Value
Exposed port	`8000`
Healthcheck	`curl -f http://localhost:8000/` every 30s
Start period	`300s` (5 minutes -- allows model download on first run)
User	`user14` (non-root)
Volume	`/ocr/documents`
Entrypoint	`uvicorn app.main:app --host 0.0.0.0 --port 8000`

Environment Variables in Docker¶

environment:
  GPU: "false"
  OUTPUT_DIR: "/ocr/documents"

All other variables (SQL_*, BLOB_*, MESSAGE_BROKER_URL) should be passed via environment variables or a .env file. For the full Docker Compose orchestration, see the Deployment page.

Build and Run¶

# Build
docker build -t nassaq-ocr:latest ./ocr

# Run
docker run -d \
  --name nassaq-ocr \
  --env-file ocr/.env \
  -p 8001:8000 \
  -v ocr-documents:/ocr/documents \
  nassaq-ocr:latest

GPU Configuration¶

EasyOCR supports GPU acceleration via CUDA. To enable it:

Set GPU=true in your .env file
Replace paddlepaddle with paddlepaddle-gpu in pyproject.toml
If using Docker, use an NVIDIA CUDA base image and install nvidia-container-toolkit

CPU Mode is Default

The default configuration runs entirely on CPU. This is adequate for most document processing workloads. GPU acceleration primarily benefits batch processing of large image collections.

Application Lifecycle¶

The FastAPI application uses a lifespan context manager that handles startup and shutdown:

sequenceDiagram
    participant App as FastAPI App
    participant P as PaddleOCR
    participant E as EasyOCR
    participant B as RabbitMQ
    participant S as Blob Storage

    Note over App: Startup
    App->>P: Load PaddleOCR engine (lang="ar")
    App->>E: Load EasyOCR reader (["ar", "en"])
    App->>B: Connect to RabbitMQ
    App->>S: Initialize BlobDownloader
    App->>B: Start consuming from ocr_queue
    Note over App: Ready to process messages

    Note over App: Shutdown
    App->>B: Close RabbitMQ connection
    App->>S: Close Blob client

Startup Sequence¶

Load PaddleOCR -- Initializes the PaddleOCR engine with use_angle_cls=False and lang="ar"
Load EasyOCR -- Creates an EasyOCR Reader for Arabic and English with GPU setting from config
Connect RabbitMQ -- Establishes a robust connection to the message broker
Initialize Blob Storage -- Creates the Azure Blob downloader client
Start Consumer -- Begins consuming messages from the configured queue (ocr_queue by default)

Shutdown Sequence¶

Close the RabbitMQ connection
Close the Blob Storage client

API Endpoints¶

The OCR service exposes two endpoints:

Method	Path	Description
`GET`	`/`	Health check (returns `204 No Content`)
`POST`	`/api/v1/docs`	Direct file upload for OCR processing (batch mode)

Primary Processing Path

In production, documents are primarily processed via the RabbitMQ consumer, not the REST endpoint. The backend server uploads files to Azure Blob Storage and publishes a message to ocr_queue. The OCR worker consumes the message, downloads the file, and processes it asynchronously.

The POST /api/v1/docs endpoint exists for direct testing and standalone use.