Free preview — 30% of full chapter. Get the guide for benchmarking scripts + tuning reference table.

Chapter 4

Performance Tuning

Your GPU sits at 40% utilization while a queue of 5,000 PDFs piles up. The defaults aren't optimized for your hardware. Here's how to fix it.

The Three Knobs That Actually Matter

MinerU performance tuning boils down to three parameters. Everything else is marginal:

  1. Batch size — How many pages MinerU processes in one forward pass. Bigger = more GPU throughput, but more VRAM. The sweet spot is usually 4-8 for scanned PDFs on a T4, 8-16 on an A10.
  2. Concurrent workers — How many Ray actors process PDFs in parallel. Each worker loads its own model copy. Too many and you OOM. Too few and GPU sits idle.
  3. VLM offload — Whether the VLM runs on the same GPU as OCR or a separate one. Splitting them can double throughput for mixed-pipeline documents.

GPU Memory Budget

Before tuning, understand where your VRAM goes. Here's the memory budget for a typical T4 (16GB):

ComponentVRAM UsageNotes
Layout detection model~1.2 GBDocTR or PaddleOCR layout
OCR recognition model~800 MBPaddleOCR rec
VLM (optional)~4-8 GBDepends on model size
CUDA context + overhead~500 MBCUDA runtime + cuDNN
Batch processing workspace~2-4 GBScales with batch size
Remaining for 2nd worker~2-7 GBOnly if VLM is offloaded

On a T4 with VLM enabled, you get one worker. Without VLM (text-only pipeline), you can run 2-3 workers on the same GPU.

Quick Wins Before Deep Tuning

  • Disable OCR for text-based PDFs — MinerU sometimes runs OCR on pages that don't need it. Set enable_ocr: auto instead of true.
  • Pre-sort PDFs by page count — Batch similar-sized PDFs together. One 500-page PDF in a batch of 10-page PDFs creates a straggler that holds up the entire batch.
  • Use FP16 for VLM — Half precision cuts VLM memory by ~40% with negligible accuracy loss for document understanding tasks.
🔒

Full chapter continues with:

Complete benchmarking scripts for your specific hardware · Tuning reference table (T4/A10/A100/L40S at batch sizes 1-32) · Concurrent worker scaling formula · GPU memory profiler script · Throughput optimization by document type · Cost-per-page optimization for cloud GPU instances

Get the Full Guide — $39