MinerU2.5-Pro proves document parsing bottlenecks stem from data quality, not architecture. Using the same 1.2B-parameter model, it achieves 95.69 on OmniDocBench v1.6 (+2.71) through a Data Engine expanding training data from <10M to 65.5M pages, outperforming models with 200× more parameters.
核心问题
Can document parsing performance be significantly improved through systematic data engineering alone, without any architectural modifications to the model?
核心方法
{'approach': "The authors retain MinerU2.5's 1.2B-parameter architecture unchanged and introduce a Data Engine with three components: DDAS for diversity-and-difficulty-aware sampling, CMCV for cross-model consistency verification using three heterogeneous models, and a Judge-and-Refine pipeline for hard sample annotation. Training follows a three-stage progressive strategy: large-scale pre-training, hard sample fine-tuning, and GRPO reinforcement learning for format alignment.", 'key_components': ['Pipeline-based methods enable independent component optimization but suffer from error propagation and inter-module information loss.', 'End-to-end VLM methods avoid cascading errors but native-resolution processing incurs O(N²) token complexity.', 'Decoupled VLM methods like MinerU2.5 combine layout analysis and content recognition, balancing controllability and semantic modeling.', 'General-purpose VLMs achieve competitive results but their large parameter scales hinder cost-effective deployment.', 'Methodological evolution has focused on architecture design, while systematic data engineering remains underexplored.', 'CMCV addresses limitations of single-model uncertainty estimation by using multi-model cross-validation to distinguish model-specific blind spots from universally hard problems.', "Three difficulty tiers are defined: Easy (model consensus indicates reliable results), Medium (external consensus provides pseudo-labels for MinerU2.5's gaps), and Hard (no reliable annotation through consensus).", 'Medium data is prioritized in DDAS sampling due to its high training value and reliable annotations from external model consensus.', 'Task-specific pairwise consistency metrics are used: edit distance for text, TEDS for tables, and CDM for formulas.'], 'section_ids': ['sec_4', 'sec_9']}
论点验证
The paper provides specific quantitative results: overall score improved from 92.98 to 95.69 (+2.71). Tables 3-5 and p_70-74 show detailed comparisons with competing models. The claim about outperforming specialized models (GLM-OCR, PaddleOCR-VL-1.5)
The architecture is explicitly specified in multiple locations: 1.2B-parameter decoupled coarse-to-fine architecture with NaViT-675M vision encoder + Qwen2-0.5B language model. The paper repeatedly emphasizes 'entirely unchanged' and 'without any str
The paper claims 'highly similar failure modes on the same hard samples' based on cross-analysis of multiple SOTA models, but provides no quantitative data measuring failure mode similarity. No statistics on error overlap rates, correlation of failur
This is a causal hypothesis about the source of performance bottlenecks. While the paper provides supporting evidence (similar failure modes across architectures, performance gains from data engineering alone), it does not provide a controlled experi
The paper provides specific numbers: 'less than 10M pages' for MinerU2.5 training data, expanded to 65.5M pages in the new system. The claim about distribution concentration and underrepresentation of long-tail scenarios is stated but not quantified
This is a logical argument about the relationship between sample difficulty and annotation reliability, but no quantitative evidence is provided. The paper doesn't present experiments measuring annotation error rates across difficulty tiers, or demon
The paper claims OmniDocBench v1.5 has 'relatively few hard samples' and 'systematic biases' but provides no quantitative analysis of v1.5's hard sample coverage or systematic measurement of matching biases. Figure 4 is referenced to illustrate the m
The contribution is clearly described and implemented: OmniDocBench v1.6 with Multi-Granularity Adaptive Matching (MGAM) and a Hard subset. The three-tier protocol (Base/Hard/Full) is specified with exact page counts (1,355/296/1,651 pages). The MGAM
This design choice is explicitly stated and implemented. The paper clearly specifies keeping the 1.2B-parameter architecture identical and focusing optimization on Data Engine and training strategy. The methodology ensures attribution of gains to dat
Specific quantitative results are provided: 95.69 overall score, +2.71 improvement over baseline 92.98. The claim about surpassing models with 200× more parameters refers to comparisons with models like Qwen3-VL-235B (235B parameters vs 1.2B = ~196×)
The Data Engine is fully specified with three components (DDAS, CMCV, Judge-and-Refine) detailed in Sections 3.1-3.3. The data expansion from under 10M to 65.5M pages is quantified. Each component's role in the closed-loop progression is described.
The three-stage training strategy is described: large-scale pre-training, high-quality hard sample fine-tuning, and GRPO format alignment. The matching to data quality tiers (Easy/Medium/Hard from CMCV) is explained. However, detailed training config
OmniDocBench v1.6 is fully specified with MGAM algorithm (p_54-64) and Hard subset construction (p_65). The three-tier framework with specific page counts is provided. The contribution is concrete and reproducible.
The paper explicitly frames data construction as a standalone research problem and provides a unified framework co-optimizing coverage, informativeness, and annotation accuracy. The contribution is methodological rather than empirical, and is well-ar
The methodological connection to ensemble-based active learning and query-by-committee is explicitly stated and appropriate. CMCV uses multi-model disagreement for sample informativeness, which aligns with these established principles.
The paper claims element-matching strategies' impact on evaluation fairness is 'largely overlooked' but provides no literature review or citation analysis demonstrating this gap. This is a claim about the state of the field that requires surveying ex
DDAS is fully specified in p_21-27 with two-stage operation (page-level and element-level sampling) that jointly optimizes diversity (via clustering) and difficulty (via CMCV). The algorithm is concrete and reproducible.
The design choice is clearly specified: CMCV uses prediction agreement among heterogeneous models to classify samples into Easy/Medium/Hard tiers. The specific models and metrics are defined in p_29-32.
The contribution is clearly described: extending from single-model introspection (IMIC, UACS) to multi-model cross-validation. The contrast with prior approaches is explained in p_28, and the CMCV methodology is detailed in p_29-34.
The specific models (MinerU2.5, PaddleOCR-VL, Qwen3-VL-30B), metrics (edit distance, TEDS, CDM), and classification approach are all explicitly specified. This is a concrete, reproducible design choice.
... 共 61 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- 代码不可用:未提供任何代码仓库或实现
- 数据不可用:训练数据和评估数据均未公开
- 超参数缺失:未提供学习率、批次大小、训练轮数、优化器设置等关键超参数
- 模型架构细节缺失:虽然提到1.2B参数模型,但缺少详细的架构规格说明
- DDAS算法实现细节缺失:动态难度感知采样的具体算法和参数未提供
- CMCV阈值未定义:Easy/Medium/Hard三个难度等级的具体判定阈值(如一致性分数阈值)未说明
- Judge-and-Refine机制细节缺失:第3.3节提到的校正机制未在摘录中详细说明
- 训练数据规模和组成:未说明训练数据的来源、规模、分布比例
- 硬件环境规格:未提供GPU型号、内存、训练时长等计算资源信息
- 随机种子:未提及实验的可重复性种子设置
局限性(作者自述)
- OmniDocBench v1.6 improves scoring fairness through corrected matching strategies, but the element-matching paradigm itself has inherent limitations.
- The ambiguity is twofold: at the format level, the same content can be expressed in multiple equivalent notations (e.g. HTML vs. Markdown for tables, different L A T E X commands for the same formula); at the structural level, the same visual layout can be legitimately represented with different element types.
- Developing semantic-equivalence-aware evaluation methods that account for both format and structural ambiguity remains an open problem.
- For vertical domains with higher precision requirements (e.g. finance, legal, medical), constructing domain-specific evaluation sets is a necessary complement.
- As model capabilities approach human-level performance, ensuring the precision of evaluation set annotations themselves becomes an increasingly pressing challenge.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-10T07:11:52+00:00 · 数据来源:Paper Collector