Free preview — 30% of full chapter. Get the guide for the complete pipeline configs.

Chapter 1

Architecture & Pipeline Design

MinerU has three different processing pipelines. Pick the wrong one and your output is garbage — even if everything else is perfect.

The Pipeline Decision Tree

MinerU's architecture isn't one-size-fits-all. The library internally routes documents through different pipelines based on content type. Understanding this routing is the difference between clean Markdown and unusable output.

The three pipelines are:

  • Text-based pipeline — For native digital PDFs. Fast, CPU-friendly. Uses PyMuPDF for text extraction + layout preservation.
  • Scanned document pipeline — For image-based PDFs. Requires OCR (PaddleOCR or Tesseract) + layout detection model. GPU recommended.
  • Mixed pipeline — For PDFs with both text and images. Most real-world documents fall here. Most complex to configure correctly.

CPU vs GPU: When It Actually Matters

The official README says GPU is "recommended" but doesn't quantify the difference. Here's what we measured:

PipelineCPU (32-core)GPU (T4)GPU (A10)
Text-based (100 pages)12s11s10s
Scanned (100 pages)340s45s22s
Mixed (100 pages)180s38s19s

The takeaway: for text-based PDFs, CPU is fine. For anything with OCR, GPU is a 7-15x speedup. But GPU type matters less than GPU memory — model loading eats VRAM before throughput matters.

Backend Selection: vLLM vs sglang vs Native

MinerU supports multiple inference backends for the VLM (Vision Language Model) component. The choice affects both speed and output quality:

  • Native transformers — Easiest setup, highest memory usage, slowest inference. Good for testing.
  • vLLM — Best throughput for batch processing. PagedAttention for efficient KV cache. Our recommendation for production.
  • sglang — Competitive with vLLM, better for structured outputs. Smaller community but active development.
🔒

Full chapter continues with:

Complete pipeline configuration for each document type · DocTR vs PaddleOCR accuracy benchmarks · VLM model selection matrix (which model for which document language) · Memory budget calculator for GPU sizing · Pipeline routing rules for mixed documents

Get the Full Guide — $39