MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale - AI 论文深度分析

TL;DR
MinerU2.5-Pro proves document parsing bottlenecks stem from data quality, not architecture. Using the same 1.2B-parameter model, it achieves 95.69 on OmniDocBench v1.6 (+2.71) through a Data Engine expanding training data from <10M to 65.5M pages, outperforming models with 200× more parameters.

已证实

证据不足

无法验证

N/A

可复现性

置信度

79%

核心问题

Can document parsing performance be significantly improved through systematic data engineering alone, without any architectural modifications to the model?

核心方法

{'approach': "The authors retain MinerU2.5's 1.2B-parameter architecture unchanged and introduce a Data Engine with three components: DDAS for diversity-and-difficulty-aware sampling, CMCV for cross-model consistency verification using three heterogeneous models, and a Judge-and-Refine pipeline for hard sample annotation. Training follows a three-stage progressive strategy: large-scale pre-training, hard sample fine-tuning, and GRPO reinforcement learning for format alignment.", 'key_components': ['Pipeline-based methods enable independent component optimization but suffer from error propagation and inter-module information loss.', 'End-to-end VLM methods avoid cascading errors but native-resolution processing incurs O(N²) token complexity.', 'Decoupled VLM methods like MinerU2.5 combine layout analysis and content recognition, balancing controllability and semantic modeling.', 'General-purpose VLMs achieve competitive results but their large parameter scales hinder cost-effective deployment.', 'Methodological evolution has focused on architecture design, while systematic data engineering remains underexplored.', 'CMCV addresses limitations of single-model uncertainty estimation by using multi-model cross-validation to distinguish model-specific blind spots from universally hard problems.', "Three difficulty tiers are defined: Easy (model consensus indicates reliable results), Medium (external consensus provides pseudo-labels for MinerU2.5's gaps), and Hard (no reliable annotation through consensus).", 'Medium data is prioritized in DDAS sampling due to its high training value and reliable annotations from external model consensus.', 'Task-specific pairwise consistency metrics are used: edit distance for text, TEDS for tables, and CDM for formulas.'], 'section_ids': ['sec_4', 'sec_9']}

论点验证

已证实 (85%) MinerU2.5-Pro improves the overall score from 92.98 to 95.69 purely through data engineering and training strategy design, outperforming both specialized document parsing models (e.g. GLM-OCR, PaddleOCR-VL-1.5, Youtu-Parsing) and general-purpose VLMs (e.g. Gemini 3 Pro, Qwen3-VL-235B).
The paper provides specific quantitative results: overall score improved from 92.98 to 95.69 (+2.71). Tables 3-5 and p_70-74 show detailed comparisons with competing models. The claim about outperforming specialized models (GLM-OCR, PaddleOCR-VL-1.5)

已证实 (95%) Built upon MinerU2.5 with its 1.2B-parameter architecture entirely unchanged.
The architecture is explicitly specified in multiple locations: 1.2B-parameter decoupled coarse-to-fine architecture with NaViT-675M vision encoder + Qwen2-0.5B language model. The paper repeatedly emphasizes 'entirely unchanged' and 'without any str

证据不足 (50%) These models exhibit highly similar failure modes on the same hard samples, with certain parsing errors common to all tested systems.
The paper claims 'highly similar failure modes on the same hard samples' based on cross-analysis of multiple SOTA models, but provides no quantitative data measuring failure mode similarity. No statistics on error overlap rates, correlation of failur

证据不足 (45%) The current performance bottleneck in document parsing stems primarily from shared deficiencies in training data, not from model architecture itself.
This is a causal hypothesis about the source of performance bottlenecks. While the paper provides supporting evidence (similar failure modes across architectures, performance gains from data engineering alone), it does not provide a controlled experi

已证实 (80%) MinerU2.5's training data totals less than 10M pages with distributions concentrated on high-frequency categories, severely underrepresenting long-tail scenarios such as complex nested tables and dense formula layouts.
The paper provides specific numbers: 'less than 10M pages' for MinerU2.5 training data, expanded to 65.5M pages in the new system. The claim about distribution concentration and underrepresentation of long-tail scenarios is stated but not quantified

证据不足 (50%) The hard samples that contribute most to model improvement are precisely those for which automatic annotation is least reliable, since no mainstream model can consistently parse them correctly.
This is a logical argument about the relationship between sample difficulty and annotation reliability, but no quantitative evidence is provided. The paper doesn't present experiments measuring annotation error rates across difficulty tiers, or demon

证据不足 (55%) OmniDocBench v1.5 contains relatively few hard samples, and its element-matching logic exhibits systematic biases toward specific output formats, introducing scoring artifacts that complicate fair cross-system comparison.
The paper claims OmniDocBench v1.5 has 'relatively few hard samples' and 'systematic biases' but provides no quantitative analysis of v1.5's hard sample coverage or systematic measurement of matching biases. Figure 4 is referenced to illustrate the m

已证实 (90%) We introduce OmniDocBench v1.6, which corrects these matching biases and incorporates a dedicated Hard subset to establish a Base/Hard/Full three-tier evaluation protocol.
The contribution is clearly described and implemented: OmniDocBench v1.6 with Multi-Granularity Adaptive Matching (MGAM) and a Hard subset. The three-tier protocol (Base/Hard/Full) is specified with exact page counts (1,355/296/1,651 pages). The MGAM

已证实 (90%) We build MinerU2.5-Pro-retaining the identical 1.2B-parameter decoupled coarse-to-fine architecture of MinerU2.5 and focusing all optimization on the Data Engine and training strategy, ensuring that all performance gains are attributable to data-level improvements.
This design choice is explicitly stated and implemented. The paper clearly specifies keeping the 1.2B-parameter architecture identical and focusing optimization on Data Engine and training strategy. The methodology ensures attribution of gains to dat

已证实 (85%) On OmniDocBench v1.6, MinerU2.5-Pro achieves 95.69 (baseline 92.98, +2.71), surpassing all existing methods, including models with over 200× more parameters.
Specific quantitative results are provided: 95.69 overall score, +2.71 improvement over baseline 92.98. The claim about surpassing models with 200× more parameters refers to comparisons with models like Qwen3-VL-235B (235B parameters vs 1.2B = ~196×)

已证实 (85%) A Data Engine co-designed around coverage, informativeness, and annotation accuracy. It comprises three core components-Diversity-and-Difficulty-Aware Sampling (DDAS), Cross-Model Consistency Verification (CMCV), and a Judge-and-Refine annotation pipeline-that together expand training data from under 10M to 65.5M pages while systematically improving annotation quality through a closed-loop progression from sampling to refinement.
The Data Engine is fully specified with three components (DDAS, CMCV, Judge-and-Refine) detailed in Sections 3.1-3.3. The data expansion from under 10M to 65.5M pages is quantified. Each component's role in the closed-loop progression is described.

已证实 (80%) A three-stage progressive training strategy-large-scale pre-training, high-quality hard sample fine-tuning, and GRPO format alignment-matched to the data quality tiers produced by the Data Engine.
The three-stage training strategy is described: large-scale pre-training, high-quality hard sample fine-tuning, and GRPO format alignment. The matching to data quality tiers (Easy/Medium/Hard from CMCV) is explained. However, detailed training config

已证实 (90%) OmniDocBench v1.6, an upgraded evaluation protocol that corrects element-matching biases in v1.5 through Multi-Granularity Adaptive Matching and introduces a Hard subset, establishing a Base/Hard/Full three-tier framework for fairer and more discriminative evaluation.
OmniDocBench v1.6 is fully specified with MGAM algorithm (p_54-64) and Hard subset construction (p_65). The three-tier framework with specific page counts is provided. The contribution is concrete and reproducible.

已证实 (75%) Our work treats data construction for document parsing as a standalone systematic research problem, co-optimizing coverage, informativeness, and annotation accuracy within a unified framework.
The paper explicitly frames data construction as a standalone research problem and provides a unified framework co-optimizing coverage, informativeness, and annotation accuracy. The contribution is methodological rather than empirical, and is well-ar

已证实 (80%) Methodologically, our CMCV approach draws on the core principles of ensemble-based active learning and query-by-committee by leveraging multi-model disagreement to quantify sample informativeness.
The methodological connection to ensemble-based active learning and query-by-committee is explicitly stated and appropriate. CMCV uses multi-model disagreement for sample informativeness, which aligns with these established principles.

证据不足 (45%) The critical impact of element-matching strategies on evaluation fairness remains largely overlooked.
The paper claims element-matching strategies' impact on evaluation fairness is 'largely overlooked' but provides no literature review or citation analysis demonstrating this gap. This is a claim about the state of the field that requires surveying ex

已证实 (85%) We propose Diversity-and-Difficulty-Aware Sampling (DDAS), which jointly optimizes diversity and difficulty at both page and element granularity.
DDAS is fully specified in p_21-27 with two-stage operation (page-level and element-level sampling) that jointly optimizes diversity (via clustering) and difficulty (via CMCV). The algorithm is concrete and reproducible.

已证实 (90%) Central to DDAS is Cross-Model Consistency Verification (CMCV), which leverages prediction agreement among heterogeneous models to classify samples into Easy/Medium/Hard difficulty tiers.
The design choice is clearly specified: CMCV uses prediction agreement among heterogeneous models to classify samples into Easy/Medium/Hard tiers. The specific models and metrics are defined in p_29-32.

已证实 (85%) We propose Cross-Model Consistency Verification (CMCV), which extends difficulty assessment from single-model introspection to multi-model cross-validation.
The contribution is clearly described: extending from single-model introspection (IMIC, UACS) to multi-model cross-validation. The contrast with prior approaches is explained in p_28, and the CMCV methodology is detailed in p_29-34.

已证实 (90%) We run three heterogeneous document parsing models (MinerU2.5, PaddleOCR-VL, Qwen3-VL-30B) independently on the candidate data produced by DDAS, compute task-specific pairwise consistency metrics (text: edit distance; table: TEDS; formula: CDM), and classify each sample into three difficulty tiers based on consistency patterns.
The specific models (MinerU2.5, PaddleOCR-VL, Qwen3-VL-30B), metrics (edit distance, TEDS, CDM), and classification approach are all explicitly specified. This is a concrete, reproducible design choice.

... 共 61 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

代码不可用：未提供任何代码仓库或实现
数据不可用：训练数据和评估数据均未公开
超参数缺失：未提供学习率、批次大小、训练轮数、优化器设置等关键超参数
模型架构细节缺失：虽然提到1.2B参数模型，但缺少详细的架构规格说明
DDAS算法实现细节缺失：动态难度感知采样的具体算法和参数未提供
CMCV阈值未定义：Easy/Medium/Hard三个难度等级的具体判定阈值（如一致性分数阈值）未说明
Judge-and-Refine机制细节缺失：第3.3节提到的校正机制未在摘录中详细说明
训练数据规模和组成：未说明训练数据的来源、规模、分布比例
硬件环境规格：未提供GPU型号、内存、训练时长等计算资源信息
随机种子：未提及实验的可重复性种子设置

局限性（作者自述）

OmniDocBench v1.6 improves scoring fairness through corrected matching strategies, but the element-matching paradigm itself has inherent limitations.
The ambiguity is twofold: at the format level, the same content can be expressed in multiple equivalent notations (e.g. HTML vs. Markdown for tables, different L A T E X commands for the same formula); at the structural level, the same visual layout can be legitimately represented with different element types.
Developing semantic-equivalence-aware evaluation methods that account for both format and structural ambiguity remains an open problem.
For vertical domains with higher precision requirements (e.g. finance, legal, medical), constructing domain-specific evaluation sets is a necessary complement.
As model capabilities approach human-level performance, ensuring the precision of evaluation set annotations themselves becomes an increasingly pressing challenge.

本分析由 PDF 阅读助手自动生成，仅供参考，不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析，可能存在偏差。原始论文请参阅 arXiv。

分析时间：2026-04-10T07:11:52+00:00 · 数据来源：Paper Collector