TL;DR
MinerU-Diffusion reformulates document OCR as inverse rendering via diffusion decoding with block-attention architecture. Achieves 2.1-3.2× speedup while maintaining 90%+ accuracy and robustness against semantic distortion, overcoming autoregressive efficiency bottlenecks.
26
已证实
11
证据不足
2
无法验证
N/A
可复现性
置信度
75%

核心问题

Can document OCR be reformulated as an inverse rendering problem using diffusion-based decoding to overcome efficiency bottlenecks and semantic hallucination issues inherent in autoregressive Vision-Language Models?

核心方法

{'approach': 'The authors propose MinerU-Diffusion, a block-attention diffusion language model with hybrid factorization: autoregressive structure across blocks and parallel diffusion refinement within blocks. A two-stage curriculum learning framework trains the model on the MinerU2.5 dataset (~7.5M samples), with diversity-driven foundational learning followed by uncertainty-driven boundary refinement using task-specific consistency metrics.', 'key_components': ['Full-attention diffusion suffers from quadratic complexity, positional instability, and unnecessary coupling of independent document regions.', 'MinerU-Diffusion introduces block-attention architecture with hybrid factorization: autoregressive across blocks and parallel diffusion within blocks.', 'Block boundaries serve as structural anchors preventing long-range alignment drift while preserving parallel efficiency within blocks.', "Structured attention masking reduces complexity from O(L²) to O(BL'²) and enables efficient KV-caching during inference.", 'The architecture conditions on native-scale visual features to keep posterior refinement grounded in visual evidence.', 'The architecture uses block-wise attention dVLM with SDAR-1.7B-Chat-b32 and block size of 32.', 'M-RoPE is removed from the MinerU2.5 architecture.', 'Training proceeds in two stages: fine-tuning on LLaVA-NeXT for VQA, followed by specialized document OCR training.', 'Additional optimization details are provided in Appendix A.', 'At matched accuracy (~52 TPS), the method achieves 108.9 TPS at 93%+ accuracy (thr=0.95), a 2.1× speedup.'], 'section_ids': ['sec_3', 'sec_5', 'sec_12', 'sec_20']}

论点验证

已证实 (85%) We propose MinerU-Diffusion, a unified diffusion-based parsing framework tailored for document OCR.
The paper fully describes the MinerU-Diffusion framework including its architecture (block-attention diffusion), training methodology (two-stage curriculum learning), and provides extensive experimental validation across multiple benchmarks (OmniDocB
已证实 (80%) We formulate document OCR explicitly as an inverse rendering problem under visual conditioning, shifting from autoregressive causal decoding to diffusion-based decoding.
The paper provides a clear mathematical formulation of document OCR as inverse rendering under visual conditioning (p_13-p_15), with the unified token sequence representation and posterior inference framework. The shift from AR to diffusion-based dec
已证实 (85%) We introduce MinerU-Diffusion, a block-attention [2,4] dVLM that incorporates structural locality into posterior refinement
The block-attention mechanism is described in detail with mathematical formulation (p_19-p_24), including the attention mask definition. Ablation experiments comparing Block-Attn vs Full-Attn are reported in p_65, demonstrating the effectiveness of i
已证实 (80%) we propose a two-stage curriculum learning framework to train the MinerU-Diffusion
The two-stage curriculum learning framework is described in detail (p_27-p_47), with Stage I focusing on foundational representation learning and Stage II on uncertainty-driven refinement. Ablation studies are mentioned in p_66 to validate the framew
已证实 (85%) we construct Semantic Shuffle, a benchmark that removes semantic coherence while keeping the visual presentation comparable
The Semantic Shuffle benchmark construction is described in p_68, starting from 112 English document images from the FOX dataset, shuffling words while maintaining visual presentation. Results are shown in Figure 7 demonstrating the contrast between
已证实 (75%) We model document OCR [32] as the inverse rendering of a unified structured token sequence
The formulation is clearly stated and mathematically defined in p_13-p_15. However, this is primarily a conceptual reframing rather than a novel technical contribution with independent validation.
证据不足 (60%) V is a shared vocabulary encompassing text symbols, layout markers, table delimiters, and mathematical operators
The shared vocabulary V is stated in p_14 but no justification is provided for why this specific vocabulary composition (text symbols, layout markers, table delimiters, mathematical operators) was chosen, nor are there ablations comparing alternative
已证实 (85%) we factorize the conditional posterior: [block-wise factorization formula]
The block-wise factorization formula is provided in p_20-p_21 with clear mathematical notation. The design is evaluated through ablation experiments comparing Block-Attn vs Full-Attn in p_65.
已证实 (75%) This hybrid factorization introduces coarse-grained autoregressive structure across blocks and parallel diffusion refinement within blocks.
This property is described in p_22 as a consequence of the hybrid factorization design. The paper explains how block boundaries serve as structural anchors while preserving parallel efficiency within blocks.
证据不足 (50%) MinerU-Diffusion conditions the diffusion process on native-scale visual features [29,30,44,8]
The use of native-scale visual features is stated in p_25 but no justification or ablation is provided comparing native-scale features against alternative visual feature extraction methods.
已证实 (80%) We divide the dataset D into two subsets: D base , which is easier to train, and D hard , which is more challenging.
The dataset division into D_base and D_hard is described in p_28 as part of the curriculum learning framework. The methodology for identifying hard samples is detailed in p_34-p_40, and ablation studies are mentioned in p_66.
证据不足 (55%) we construct a large-scale, diverse, and balanced dataset D base through data curation and automated annotation refinement
While D_base construction is described in p_29-p_32, the paper does not provide specific quantitative metrics for 'large-scale' (only mentions 7.5M total samples later), 'diverse', or 'balanced'. The properties are stated qualitatively without concre
证据不足 (50%) We define a task-specific consistency metric S(•, •): (1) PageIoU for layout, (2) CDM for formula, (3) TEDS for tables.
The task-specific consistency metrics are defined in p_36 but no justification is provided for why PageIoU, CDM, and TEDS were chosen over alternative metrics for measuring prediction consistency.
已证实 (90%) All meta training data are derived from the MinerU2.5 dataset [29], with a total volume of approximately 7.5M samples.
This is a factual specification stated in p_49 with a concrete number (approximately 7.5M samples) and clear source attribution (MinerU2.5 dataset).
证据不足 (50%) we employ the SDAR-1.7B-Chat-b32 [4] with a block size of 32
The model choice (SDAR-1.7B-Chat-b32) and block size (32) are stated in p_50 but no justification or ablation is provided for why this specific model or block size was chosen over alternatives.
证据不足 (50%) We first fine-tune the MinerU-Diffusion on the LLaVA-NeXT dataset [19] for visual question answering (VQA) tasks.
The LLaVA-NeXT fine-tuning step is mentioned in p_50 but no justification is provided for why this initialization strategy was chosen, nor are there ablations comparing different initialization approaches.
已证实 (80%) The decoding threshold is set to T = 0.95, with top-k = 0, temperature = 1.0, and top-p = 1.0.
The decoding parameters are specified in p_51, and threshold sensitivity analysis is provided in Figures 4 and 5, showing the effects of varying the threshold parameter.
已证实 (75%) MinerU-Diffusion achieves performance on par with state-of-the-art approaches across multiple challenging document parsing benchmarks and semantic perturbation settings
Quantitative results are provided across multiple benchmarks: OmniDocBench (Table 2), CC-OCR and OCRBench v2 for tables (Table 3), UniMER-Test for formulas (Table 3), and Semantic Shuffle (Figure 7). Performance is competitive with SOTA approaches, t
已证实 (90%) MinerU-Diffusion achieves 91.6/91.6/92.0/96.8 on CPE/HWE/SCE/SPE, consistently demonstrating strong performance across complex, handwritten, and printed settings.
Specific numerical results (91.6/91.6/92.0/96.8 on CPE/HWE/SCE/SPE) are stated in p_58 and shown in Table 3 for UniMER-Test formula recognition benchmark.
已证实 (85%) our method achieves 108.9 TPS at 93%+ accuracy (thr = 0.95), corresponding to a 2.1× speedup, and reaches 164.8 TPS while maintaining over 90% accuracy (thr = 0.6), yielding a peak acceleration of approximately 3.2×
Specific throughput numbers (108.9 TPS at thr=0.95, 164.8 TPS at thr=0.6) and speedup factors (2.1× and 3.2×) are provided in p_62 with reference to Figure 4.

... 共 39 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

局限性(作者自述)

本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv

分析时间:2026-03-28T17:34:24+00:00 · 数据来源:Paper Collector