CiteVQA introduces a benchmark for evaluating evidence attribution in document VQA across 1,897 questions from 711 PDFs. Testing 20 MLLMs reveals "Attribution Hallucination"—correct answers with wrong evidence citations. State-of-the-art models achieve only 76.
核心问题
Can current Document Visual Question Answering models faithfully attribute their answers to correct visual evidence locations in documents, rather than just producing accurate textual responses?
核心方法
{'approach': 'The authors constructed CiteVQA benchmark with 1,897 questions from 711 PDFs using an automated annotation pipeline that generates QA pairs with element-level bounding-box citations. They evaluated 20 state-of-the-art MLLMs using Strict Attributed Accuracy (SAA), a metric requiring both correct answers and faithful evidence attribution verified through IoU thresholds and LLM-based relevance scoring.', 'key_components': ['Closed-source MLLMs dominate the benchmark, with Gemini-3.1-Pro-Preview leading at 76.0 Overall SAA.', 'GPT-5.4 excels in answer correctness (87.1) but is surpassed by Gemini models in SAA.', 'Open-source models show a significant performance cliff, with strongest achieving only 22.5 SAA.', 'Small-scale MLLMs struggle most with SAA scores often below 10.0, making them risky for high-stakes domains.', 'Two test sets used: FinRAGBench-V subset (~200 samples) and ViDoRe V3 held-out set (~400 samples).', 'ViDoRe V3 test set is manually annotated and completely disjoint from training data.', 'Evaluation metrics are identical to those used in AgenticOCR.', 'Input resolution impact analysis shows precipitous drop in grounding reliability as resolution decreases.', 'Standardized inference infrastructure used 8×NVIDIA H200 GPUs.', 'Infrastructure ensures consistent latency and sufficient VRAM for high-resolution document processing.'], 'section_ids': ['sec_19', 'sec_33', 'sec_35']}
论点验证
The paper provides concrete quantitative evidence for this benchmark composition claim. Paragraph 2 explicitly states 'CiteVQA comprises 1,897 highquality questions derived from 711 PDFs across seven major domains.' Paragraph 24 further confirms thes
While the paper describes the automated annotation pipeline in detail (paragraphs 12-23), the claim that it 'ensures fine-grained precision and consistency' is asserted rather than demonstrated with quantitative evidence in the main text. Paragraph 2
The Traceability Metrics are fully specified in the paper. Paragraph 4 introduces the concept, and paragraphs 25-30 provide formal definitions including the SAA formula (paragraph 29: SAA = 1(Ans.≥4∧(Rel.≥4∨Rec.≥0.6)). The metrics are clearly defined
Specific quantitative results are provided in the paper. Paragraph 5 states 'The SAA of state-of-the-art models like Gemini-3.1-Pro-Preview caps at 76.0, while leading open-source MLLMs fail to surpass the 25.0 threshold.' Paragraph 34 confirms 'Gemi
The 'Attribution Hallucination' phenomenon is demonstrated through systematic evaluation of 20 MLLMs. Paragraph 33 provides concrete evidence: 'while GPT-5.4 and Gemini-3-Flash achieve high answer scores (87.1 and 84.5), their SAA scores drop signifi
Paragraph 13 provides specific numbers: 'Starting from a corpus of over 100 million raw PDF documents (primarily sourced from Common Crawl), we first preselected approximately 250k candidate documents through stratified sampling.' The design choice i
Paragraph 14 states '711 documents were selected as the source for CiteVQA, achieving a balanced coverage across 7 domains and 30 sub-categories.' Paragraph 24 confirms 'CiteVQA is a diverse benchmark comprising 711 documents across 7 macro-domains.'
The linking strategy is described in paragraph 16 and Appendix B.1, but the paper does not provide quantitative validation of its effectiveness in the main text. The claim that it 'integrates isolated documents into logically connected groups' is sta
Paragraph 17 explicitly states: 'We utilize MinerU2.5 for deep document parsing to obtain finegrained results containing document IDs, page numbers, bounding box (BBox) coordinates, and OCR content.' This is a clearly specified design choice with too
Paragraph 17 describes this design choice: 'Drawing inspiration from DocDancer and WebSailor, we employ high-performance MLLMs (e.g., Gemini-3.0-Flash-Preview) as intelligent agents. These agents navigate the parsed BBox space to identify and concate
Paragraph 18 describes the template-based QA synthesis: 'During construction, high-performance MLLMs first select the most appropriate logical template based on the characteristics of the Evidence Package, subsequently synthesizing QA pairs automatic
Paragraph 21 explicitly describes this filtering step: 'To ensure the challenging nature of the dataset, we execute a "zero-document self-test" using Qwen3-VL-235B-A22B-Instruct: questions that the model can answer without any document context (class
Paragraph 22 describes the ablation-based procedure: 'For the core evidence chain determination, we designed an ablation-based crucial evidence identification procedure: each BBox element in the Evidence Package is masked individually before being pr
Paragraph 24 provides comprehensive statistics: 'CiteVQA is a diverse benchmark comprising 711 documents across 7 macro-domains, with a realistic average length of 40.6 pages. The 1,897 questions cover varied scenarios including single-doc (52.0%), m
Paragraph 24 provides specific numbers: 'Each task requires an average of 2.57 evidence elements, nearly 30% of which are non-textual (tables, images, or equations).' The quantitative claim is directly supported.
Paragraph 29 provides the exact formula: 'A sample-level binary metric requiring both high-quality grounding and answer correctness: SAA = 1 (Ans.≥4∧(Rel.≥4∨Rec.≥0.6)).' The metric is fully and formally specified.
Paragraph 31 describes the experimental setup: 'We evaluated 20 state-of-the-art MLLMs, encompassing both leading proprietary and open-source models, on the CiteVQA benchmark... All models were tested using a unified prompt template with a sampling t
Paragraph 31 states: 'For automated evaluation, we employed Qwen3-VL-235B-A22B as the primary judge.' The design choice is clearly specified with the exact model name.
Paragraph 33 provides specific numbers: 'while GPT-5.4 and Gemini-3-Flash achieve high answer scores (87.1 and 84.5), their SAA scores drop significantly to 59.0 and 65.4, respectively.' The quantitative evidence directly supports the finding about t
Paragraph 34 provides specific numbers: 'Closed-source MLLMs dominate the benchmark, with Gemini-3.1-Pro-Preview leading at an Overall SAA of 76.0. While GPT-5.4 excels in semantic answer correctness (87.1), it is surpassed by Gemini models in SAA.'
... 共 39 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- Code is not publicly available - no implementation details can be verified or reused
- Dataset CiteVQA is not publicly available - benchmark cannot be reproduced or used for comparison
- Prompt template used for evaluation is mentioned but not provided in full
- Random seeds are not specified for reproducibility of results
- Exact list and versions of the 20 evaluated MLLMs with their specific API configurations
- Judge model (Qwen3-VL-235B-A22B) prompting strategy and configuration details
- Batch size used during training
- Optimizer type and configuration (only learning rate is specified)
- Ground truth annotation process and guidelines for the benchmark
- Data preprocessing steps for document screenshots
局限性(作者自述)
- although the benchmark spans seven major domains, the definition of authoritative evidence may involve domain-specific nuances in highly specialized vertical fields that warrant further exploration.
- our automated curation pipeline prioritizes data fidelity by leveraging state-of-the-art Multimodal Large Language Models (MLLMs), which, while ensuring high-quality reasoning and attribution, introduces a significant computational resource barrier for large-scale replication.
- the multi-dimensional evaluation protocol-incorporating coordinate verification and fine-grained textual alignment-requires higher computational overhead compared to standard VQA tasks, representing a deliberate choice to prioritize evaluative depth and traceability over raw scoring efficiency.
- A potential negative impact is the risk of models overfitting to the specific metrics and document distributions of CiteVQA. While our benchmark aim to improve document intelligence, excessive optimization for these specific tasks may lead to reduced generalizability when models encounter diverse real-world document structures not represented in our dataset.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-05-19T01:08:06+00:00 · 数据来源:Paper Collector