Free preview — 30% of full chapter. Get the guide for the complete Ray cluster + queue management code.

Chapter 3

Multi-Node Batch Processing

One GPU processes ~15 pages/second on scanned PDFs. A 10,000-document batch with average 50 pages each = 500,000 pages = 9 hours on a single GPU. Multi-node cuts that to minutes.

Ray Cluster Architecture for MinerU

Ray is MinerU's distributed backbone. It handles task scheduling, GPU allocation, and failure recovery. A production cluster needs three components:

  • Ray Head Node — Scheduler + object store coordinator. Lightweight — 2 vCPU, 4GB RAM is enough. Runs the queue manager.
  • Ray Worker Nodes — GPU instances that run the actual MinerU pipelines. These are the expensive ones.
  • Shared Storage — NFS or S3 bucket for input PDFs and output Markdown. Every worker needs read/write access.

Why Not Just Multiprocessing?

Python's multiprocessing hits three walls with MinerU:

  1. GIL contention — MinerU's C-extensions (PyMuPDF, PaddlePaddle) release the GIL, but the Python orchestration layer doesn't.
  2. Memory duplication — Each process loads its own copy of the 2GB+ models. With 8 workers, that's 16GB just in model duplication.
  3. No failure isolation — One OOM kill on a corrupted PDF takes down your entire batch. Ray isolates failures to individual tasks.

Storage Architecture

The shared filesystem is the hardest part of multi-node MinerU. Workers need synchronized access to input files, model cache, and output directories:

s3://pdf-pipeline/
  |-- input/          # Upload PDFs here
  |-- output/         # MinerU writes Markdown here
  |-- failed/         # Corrupted/unprocessable PDFs
  |-- models/         # Shared model cache (read-only)
  |-- checkpoint/     # Ray checkpoint directory
🔒

Full chapter continues with:

Complete Ray cluster YAML config for AWS/GCP · Queue manager with priority lanes + dead-letter queue · Autoscaling rules for spot/preemptible instances · Failure recovery with checkpoint/resume · Monitoring dashboard setup (Grafana + Ray Dashboard) · Cost optimization: spot vs on-demand vs reserved

Get the Full Guide — $39