Chapter 3
Multi-Node Batch Processing
One GPU processes ~15 pages/second on scanned PDFs. A 10,000-document batch with average 50 pages each = 500,000 pages = 9 hours on a single GPU. Multi-node cuts that to minutes.
Ray Cluster Architecture for MinerU
Ray is MinerU's distributed backbone. It handles task scheduling, GPU allocation, and failure recovery. A production cluster needs three components:
- Ray Head Node — Scheduler + object store coordinator. Lightweight — 2 vCPU, 4GB RAM is enough. Runs the queue manager.
- Ray Worker Nodes — GPU instances that run the actual MinerU pipelines. These are the expensive ones.
- Shared Storage — NFS or S3 bucket for input PDFs and output Markdown. Every worker needs read/write access.
Why Not Just Multiprocessing?
Python's multiprocessing hits three walls with MinerU:
- GIL contention — MinerU's C-extensions (PyMuPDF, PaddlePaddle) release the GIL, but the Python orchestration layer doesn't.
- Memory duplication — Each process loads its own copy of the 2GB+ models. With 8 workers, that's 16GB just in model duplication.
- No failure isolation — One OOM kill on a corrupted PDF takes down your entire batch. Ray isolates failures to individual tasks.
Storage Architecture
The shared filesystem is the hardest part of multi-node MinerU. Workers need synchronized access to input files, model cache, and output directories:
s3://pdf-pipeline/
|-- input/ # Upload PDFs here
|-- output/ # MinerU writes Markdown here
|-- failed/ # Corrupted/unprocessable PDFs
|-- models/ # Shared model cache (read-only)
|-- checkpoint/ # Ray checkpoint directory
Full chapter continues with:
Complete Ray cluster YAML config for AWS/GCP · Queue manager with priority lanes + dead-letter queue · Autoscaling rules for spot/preemptible instances · Failure recovery with checkpoint/resume · Monitoring dashboard setup (Grafana + Ray Dashboard) · Cost optimization: spot vs on-demand vs reserved
Get the Full Guide — $39