CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence - AI 论文深度分析

TL;DR
CiteVQA introduces a benchmark for evaluating evidence attribution in document VQA across 1,897 questions from 711 PDFs. Testing 20 MLLMs reveals "Attribution Hallucination"—correct answers with wrong evidence citations. State-of-the-art models achieve only 76.

已证实

证据不足

无法验证

N/A

可复现性

置信度

83%

核心问题

Can current Document Visual Question Answering models faithfully attribute their answers to correct visual evidence locations in documents, rather than just producing accurate textual responses?

核心方法

{'approach': 'The authors constructed CiteVQA benchmark with 1,897 questions from 711 PDFs using an automated annotation pipeline that generates QA pairs with element-level bounding-box citations. They evaluated 20 state-of-the-art MLLMs using Strict Attributed Accuracy (SAA), a metric requiring both correct answers and faithful evidence attribution verified through IoU thresholds and LLM-based relevance scoring.', 'key_components': ['Closed-source MLLMs dominate the benchmark, with Gemini-3.1-Pro-Preview leading at 76.0 Overall SAA.', 'GPT-5.4 excels in answer correctness (87.1) but is surpassed by Gemini models in SAA.', 'Open-source models show a significant performance cliff, with strongest achieving only 22.5 SAA.', 'Small-scale MLLMs struggle most with SAA scores often below 10.0, making them risky for high-stakes domains.', 'Two test sets used: FinRAGBench-V subset (~200 samples) and ViDoRe V3 held-out set (~400 samples).', 'ViDoRe V3 test set is manually annotated and completely disjoint from training data.', 'Evaluation metrics are identical to those used in AgenticOCR.', 'Input resolution impact analysis shows precipitous drop in grounding reliability as resolution decreases.', 'Standardized inference infrastructure used 8×NVIDIA H200 GPUs.', 'Infrastructure ensures consistent latency and sufficient VRAM for high-resolution document processing.'], 'section_ids': ['sec_19', 'sec_33', 'sec_35']}

论点验证

已证实 (95%) To address these limitations, we introduce CiteVQA: A Benchmark for Faithful Evidence Attribution. Designed for long-form, multi-domain, and cross-lingual scenarios, CiteVQA comprises 1,897 highquality questions derived from 711 PDFs across seven major domains.
The paper provides concrete quantitative evidence for this benchmark composition claim. Paragraph 2 explicitly states 'CiteVQA comprises 1,897 highquality questions derived from 711 PDFs across seven major domains.' Paragraph 24 further confirms thes

证据不足 (55%) To this end, we developed a highly scalable, automated annotation pipeline. By synergizing advanced document parsing models with powerful MLLMs, this flexible pipeline ensures fine-grained precision and consistency, effectively laying the foundation for large-scale citation data generation while mitigating subjective human biases during the annotation process.
While the paper describes the automated annotation pipeline in detail (paragraphs 12-23), the claim that it 'ensures fine-grained precision and consistency' is asserted rather than demonstrated with quantitative evidence in the main text. Paragraph 2

已证实 (92%) For evaluation, we move beyond answer accuracy and introduce a suite of Traceability Metrics. At its core is Strict Attributed Accuracy (SAA), a rigorous audit requiring the model to be correct in both its textual response and its visual evidence attribution.
The Traceability Metrics are fully specified in the paper. Paragraph 4 introduces the concept, and paragraphs 25-30 provide formal definitions including the SAA formula (paragraph 29: SAA = 1(Ans.≥4∧(Rel.≥4∨Rec.≥0.6)). The metrics are clearly defined

已证实 (95%) The SAA of state-of-the-art models like Gemini-3.1-Pro-Preview caps at 76.0, while leading open-source MLLMs fail to surpass the 25.0 threshold.
Specific quantitative results are provided in the paper. Paragraph 5 states 'The SAA of state-of-the-art models like Gemini-3.1-Pro-Preview caps at 76.0, while leading open-source MLLMs fail to surpass the 25.0 threshold.' Paragraph 34 confirms 'Gemi

已证实 (88%) Discovery of the "Attribution Hallucination" Phenomenon: Through a comprehensive audit of 20 leading MLLMs, we expose a critical vulnerability: models frequently output correct text while grounding it in entirely incorrect visual evidence.
The 'Attribution Hallucination' phenomenon is demonstrated through systematic evaluation of 20 MLLMs. Paragraph 33 provides concrete evidence: 'while GPT-5.4 and Gemini-3-Flash achieve high answer scores (87.1 and 84.5), their SAA scores drop signifi

已证实 (85%) Starting from a corpus of over 100 million raw PDF documents (primarily sourced from Common Crawl), we first preselected approximately 250k candidate documents through stratified sampling.
Paragraph 13 provides specific numbers: 'Starting from a corpus of over 100 million raw PDF documents (primarily sourced from Common Crawl), we first preselected approximately 250k candidate documents through stratified sampling.' The design choice i

已证实 (92%) Ultimately, 711 documents were selected as the source for CiteVQA, achieving a balanced coverage across 7 domains and 30 sub-categories.
Paragraph 14 states '711 documents were selected as the source for CiteVQA, achieving a balanced coverage across 7 domains and 30 sub-categories.' Paragraph 24 confirms 'CiteVQA is a diverse benchmark comprising 711 documents across 7 macro-domains.'

证据不足 (50%) To overcome single-document limitations, we propose a linking strategy that aggregates cross-document evidence via semantic alignment. The system identifies candidates through vector similarity and utilizes an LLM to align section-level metadata, integrating isolated documents into logically connected groups.
The linking strategy is described in paragraph 16 and Appendix B.1, but the paper does not provide quantitative validation of its effectiveness in the main text. The claim that it 'integrates isolated documents into logically connected groups' is sta

已证实 (90%) We utilize MinerU2.5 for deep document parsing to obtain finegrained results containing document IDs, page numbers, bounding box (BBox) coordinates, and OCR content.
Paragraph 17 explicitly states: 'We utilize MinerU2.5 for deep document parsing to obtain finegrained results containing document IDs, page numbers, bounding box (BBox) coordinates, and OCR content.' This is a clearly specified design choice with too

已证实 (88%) Drawing inspiration from DocDancer and WebSailor, we employ high-performance MLLMs (e.g., Gemini-3.0-Flash-Preview) as intelligent agents. These agents navigate the parsed BBox space to identify and concatenate supporting facts scattered across different pages or documents, ultimately aggregating them into a comprehensive Evidence Package.
Paragraph 17 describes this design choice: 'Drawing inspiration from DocDancer and WebSailor, we employ high-performance MLLMs (e.g., Gemini-3.0-Flash-Preview) as intelligent agents. These agents navigate the parsed BBox space to identify and concate

已证实 (85%) During construction, high-performance MLLMs first select the most appropriate logical template based on the characteristics of the Evidence Package, subsequently synthesizing QA pairs automatically based on template constraints and core information within the evidence.
Paragraph 18 describes the template-based QA synthesis: 'During construction, high-performance MLLMs first select the most appropriate logical template based on the characteristics of the Evidence Package, subsequently synthesizing QA pairs automatic

已证实 (88%) To ensure the challenging nature of the dataset, we execute a "zero-document self-test" using Qwen3-VL-235B-A22B-Instruct: questions that the model can answer without any document context (classified as common-knowledge-based) are discarded.
Paragraph 21 explicitly describes this filtering step: 'To ensure the challenging nature of the dataset, we execute a "zero-document self-test" using Qwen3-VL-235B-A22B-Instruct: questions that the model can answer without any document context (class

已证实 (85%) For the core evidence chain determination, we designed an ablation-based crucial evidence identification procedure: each BBox element in the Evidence Package is masked individually before being presented to a powerful MLLM. If the model fails to derive the correct answer after a mask is applied, that element is labeled as "Crucial Evidence."
Paragraph 22 describes the ablation-based procedure: 'For the core evidence chain determination, we designed an ablation-based crucial evidence identification procedure: each BBox element in the Evidence Package is masked individually before being pr

已证实 (95%) CiteVQA is a diverse benchmark comprising 711 documents across 7 macro-domains, with a realistic average length of 40.6 pages. The 1,897 questions cover varied scenarios including single-doc (52.0%), multi-doc with one gold document (25.7%), and multi-doc with multiple gold documents (22.3%), spanning reasoning types from Complex Synthesis to Multimodal Parsing.
Paragraph 24 provides comprehensive statistics: 'CiteVQA is a diverse benchmark comprising 711 documents across 7 macro-domains, with a realistic average length of 40.6 pages. The 1,897 questions cover varied scenarios including single-doc (52.0%), m

已证实 (92%) Each task requires an average of 2.57 evidence elements, nearly 30% of which are non-textual (tables, images, or equations).
Paragraph 24 provides specific numbers: 'Each task requires an average of 2.57 evidence elements, nearly 30% of which are non-textual (tables, images, or equations).' The quantitative claim is directly supported.

已证实 (95%) A sample-level binary metric requiring both high-quality grounding and answer correctness: SAA = 1 (Ans.≥4∧(Rel.≥4∨Rec.≥0.6)).
Paragraph 29 provides the exact formula: 'A sample-level binary metric requiring both high-quality grounding and answer correctness: SAA = 1 (Ans.≥4∧(Rel.≥4∨Rec.≥0.6)).' The metric is fully and formally specified.

已证实 (90%) We evaluated 20 state-of-the-art MLLMs, encompassing both leading proprietary and open-source models, on the CiteVQA benchmark. All models were tested using a unified prompt template with a sampling temperature of 1.0.
Paragraph 31 describes the experimental setup: 'We evaluated 20 state-of-the-art MLLMs, encompassing both leading proprietary and open-source models, on the CiteVQA benchmark... All models were tested using a unified prompt template with a sampling t

已证实 (90%) For automated evaluation, we employed Qwen3-VL-235B-A22B as the primary judge.
Paragraph 31 states: 'For automated evaluation, we employed Qwen3-VL-235B-A22B as the primary judge.' The design choice is clearly specified with the exact model name.

已证实 (92%) while GPT-5.4 and Gemini-3-Flash achieve high answer scores (87.1 and 84.5), their SAA scores drop significantly to 59.0 and 65.4, respectively. This discrepancy confirms an "Attribution Hallucination" effect: while models possess the perceptual capacity to extract information for a correct answer, they lack the ability to precisely link that information to its specific spatial source within the document.
Paragraph 33 provides specific numbers: 'while GPT-5.4 and Gemini-3-Flash achieve high answer scores (87.1 and 84.5), their SAA scores drop significantly to 59.0 and 65.4, respectively.' The quantitative evidence directly supports the finding about t

已证实 (92%) Closed-source MLLMs dominate the benchmark, with Gemini-3.1-Pro-Preview leading at an Overall SAA of 76.0. While GPT-5.4 excels in semantic answer correctness (87.1), it is surpassed by Gemini models in SAA, suggesting Gemini may have more robust native citationalignment.
Paragraph 34 provides specific numbers: 'Closed-source MLLMs dominate the benchmark, with Gemini-3.1-Pro-Preview leading at an Overall SAA of 76.0. While GPT-5.4 excels in semantic answer correctness (87.1), it is surpassed by Gemini models in SAA.'

... 共 39 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

Code is not publicly available - no implementation details can be verified or reused
Dataset CiteVQA is not publicly available - benchmark cannot be reproduced or used for comparison
Prompt template used for evaluation is mentioned but not provided in full
Random seeds are not specified for reproducibility of results
Exact list and versions of the 20 evaluated MLLMs with their specific API configurations
Judge model (Qwen3-VL-235B-A22B) prompting strategy and configuration details
Batch size used during training
Optimizer type and configuration (only learning rate is specified)
Ground truth annotation process and guidelines for the benchmark
Data preprocessing steps for document screenshots

局限性（作者自述）

although the benchmark spans seven major domains, the definition of authoritative evidence may involve domain-specific nuances in highly specialized vertical fields that warrant further exploration.
our automated curation pipeline prioritizes data fidelity by leveraging state-of-the-art Multimodal Large Language Models (MLLMs), which, while ensuring high-quality reasoning and attribution, introduces a significant computational resource barrier for large-scale replication.
the multi-dimensional evaluation protocol-incorporating coordinate verification and fine-grained textual alignment-requires higher computational overhead compared to standard VQA tasks, representing a deliberate choice to prioritize evaluative depth and traceability over raw scoring efficiency.
A potential negative impact is the risk of models overfitting to the specific metrics and document distributions of CiteVQA. While our benchmark aim to improve document intelligence, excessive optimization for these specific tasks may lead to reduced generalizability when models encounter diverse real-world document structures not represented in our dataset.

本分析由 PDF 阅读助手自动生成，仅供参考，不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析，可能存在偏差。原始论文请参阅 arXiv。

分析时间：2026-05-19T01:08:06+00:00 · 数据来源：Paper Collector