MinerU-Diffusion reformulates document OCR as inverse rendering via diffusion decoding with block-attention architecture. Achieves 2.1-3.2× speedup while maintaining 90%+ accuracy and robustness against semantic distortion, overcoming autoregressive efficiency bottlenecks.
核心问题
Can document OCR be reformulated as an inverse rendering problem using diffusion-based decoding to overcome efficiency bottlenecks and semantic hallucination issues inherent in autoregressive Vision-Language Models?
核心方法
{'approach': 'The authors propose MinerU-Diffusion, a block-attention diffusion language model with hybrid factorization: autoregressive structure across blocks and parallel diffusion refinement within blocks. A two-stage curriculum learning framework trains the model on the MinerU2.5 dataset (~7.5M samples), with diversity-driven foundational learning followed by uncertainty-driven boundary refinement using task-specific consistency metrics.', 'key_components': ['Full-attention diffusion suffers from quadratic complexity, positional instability, and unnecessary coupling of independent document regions.', 'MinerU-Diffusion introduces block-attention architecture with hybrid factorization: autoregressive across blocks and parallel diffusion within blocks.', 'Block boundaries serve as structural anchors preventing long-range alignment drift while preserving parallel efficiency within blocks.', "Structured attention masking reduces complexity from O(L²) to O(BL'²) and enables efficient KV-caching during inference.", 'The architecture conditions on native-scale visual features to keep posterior refinement grounded in visual evidence.', 'The architecture uses block-wise attention dVLM with SDAR-1.7B-Chat-b32 and block size of 32.', 'M-RoPE is removed from the MinerU2.5 architecture.', 'Training proceeds in two stages: fine-tuning on LLaVA-NeXT for VQA, followed by specialized document OCR training.', 'Additional optimization details are provided in Appendix A.', 'At matched accuracy (~52 TPS), the method achieves 108.9 TPS at 93%+ accuracy (thr=0.95), a 2.1× speedup.'], 'section_ids': ['sec_3', 'sec_5', 'sec_12', 'sec_20']}
论点验证
The paper fully describes the MinerU-Diffusion framework including its architecture (block-attention diffusion), training methodology (two-stage curriculum learning), and provides extensive experimental validation across multiple benchmarks (OmniDocB
The paper provides a clear mathematical formulation of document OCR as inverse rendering under visual conditioning (p_13-p_15), with the unified token sequence representation and posterior inference framework. The shift from AR to diffusion-based dec
The block-attention mechanism is described in detail with mathematical formulation (p_19-p_24), including the attention mask definition. Ablation experiments comparing Block-Attn vs Full-Attn are reported in p_65, demonstrating the effectiveness of i
The two-stage curriculum learning framework is described in detail (p_27-p_47), with Stage I focusing on foundational representation learning and Stage II on uncertainty-driven refinement. Ablation studies are mentioned in p_66 to validate the framew
The Semantic Shuffle benchmark construction is described in p_68, starting from 112 English document images from the FOX dataset, shuffling words while maintaining visual presentation. Results are shown in Figure 7 demonstrating the contrast between
The formulation is clearly stated and mathematically defined in p_13-p_15. However, this is primarily a conceptual reframing rather than a novel technical contribution with independent validation.
The shared vocabulary V is stated in p_14 but no justification is provided for why this specific vocabulary composition (text symbols, layout markers, table delimiters, mathematical operators) was chosen, nor are there ablations comparing alternative
The block-wise factorization formula is provided in p_20-p_21 with clear mathematical notation. The design is evaluated through ablation experiments comparing Block-Attn vs Full-Attn in p_65.
This property is described in p_22 as a consequence of the hybrid factorization design. The paper explains how block boundaries serve as structural anchors while preserving parallel efficiency within blocks.
The use of native-scale visual features is stated in p_25 but no justification or ablation is provided comparing native-scale features against alternative visual feature extraction methods.
The dataset division into D_base and D_hard is described in p_28 as part of the curriculum learning framework. The methodology for identifying hard samples is detailed in p_34-p_40, and ablation studies are mentioned in p_66.
While D_base construction is described in p_29-p_32, the paper does not provide specific quantitative metrics for 'large-scale' (only mentions 7.5M total samples later), 'diverse', or 'balanced'. The properties are stated qualitatively without concre
The task-specific consistency metrics are defined in p_36 but no justification is provided for why PageIoU, CDM, and TEDS were chosen over alternative metrics for measuring prediction consistency.
This is a factual specification stated in p_49 with a concrete number (approximately 7.5M samples) and clear source attribution (MinerU2.5 dataset).
The model choice (SDAR-1.7B-Chat-b32) and block size (32) are stated in p_50 but no justification or ablation is provided for why this specific model or block size was chosen over alternatives.
The LLaVA-NeXT fine-tuning step is mentioned in p_50 but no justification is provided for why this initialization strategy was chosen, nor are there ablations comparing different initialization approaches.
The decoding parameters are specified in p_51, and threshold sensitivity analysis is provided in Figures 4 and 5, showing the effects of varying the threshold parameter.
Quantitative results are provided across multiple benchmarks: OmniDocBench (Table 2), CC-OCR and OCRBench v2 for tables (Table 3), UniMER-Test for formulas (Table 3), and Semantic Shuffle (Figure 7). Performance is competitive with SOTA approaches, t
Specific numerical results (91.6/91.6/92.0/96.8 on CPE/HWE/SCE/SPE) are stated in p_58 and shown in Table 3 for UniMER-Test formula recognition benchmark.
Specific throughput numbers (108.9 TPS at thr=0.95, 164.8 TPS at thr=0.6) and speedup factors (2.1× and 3.2×) are provided in p_62 with reference to Figure 4.
... 共 39 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- No code repository available
- No training/evaluation data available or specified
- Critical hyperparameters missing: learning rate, batch size (training), number of epochs, optimizer, weight decay, learning rate schedule
- Random seeds not specified
- Appendix A referenced for 'additional optimization details' but not provided in the text
- Training hardware specifications not mentioned (only H200 GPU mentioned for inference throughput testing)
- Software environment details missing (PyTorch version, CUDA version, dependencies)
- Training data splits and dataset preparation details not specified
- Preprocessing steps for document OCR training not described
- Evaluation metrics implementation details not provided (TEDS, TEDS-S, CPE, HWE, SCE, SPE)
局限性(作者自述)
- no dedicated evaluation was conducted for low-resource languages
- The remaining gap to the best specialized pipeline (e.g., MinerU2.5 [29]) is most visible on harder categories like CPE, suggesting future improvements should focus on more precise symbol-level modeling and structure-aware decoding for complex printed expressions.
- Full self-attention incurs quadratic complexity O(L 2 ) with respect to sequence length L, making it computationally expensive for long structured documents with thousands of tokens.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-03-28T17:34:24+00:00 · 数据来源:Paper Collector