TL;DR
The paper identifies validity issues in VSI-Bench for VLM spatial reasoning: annotation drift and scene-observability mismatch. The authors introduce ReVSI, re-annotating 381 scenes with frame-budgeted protocols and dummy-video diagnostics.
27
已证实
17
证据不足
3
无法验证
N/A
可复现性
置信度
70%

核心问题

What are the validity issues in existing spatial reasoning benchmarks for VLMs, and how can benchmark design be improved to accurately assess 3D spatial reasoning capabilities?

核心方法

{'approach': 'The authors re-annotate objects and geometry across 381 scenes from 5 datasets using professional 3D tools, regenerate QA pairs with bias mitigation and human verification, and construct frame-budgeted evaluation variants (16/32/64/all frames). They introduce dummy-video diagnostics that remove frames containing queried objects to test whether models rely on visual evidence versus memorized priors.', 'key_components': [], 'section_ids': []}

论点验证

已证实 (75%) We improve evaluation validity by introducing ReVSI, a benchmark and protocol that ensures each QA pair is answerable and correct under the model's actual inputs.
The paper provides detailed methodology for ReVSI (p_7-p_8, p_19-p_20, p_31-p_32) including reannotation, human verification, and frame-budgeted QA construction. However, the claim that QA pairs are 'answerable and correct' is primarily demonstrated
证据不足 (50%) We reannotate objects and geometry across 381 scenes from 5 datasets to improve data quality, and regenerate all QA pairs with rigorous bias mitigation and human verification using professional 3D annotation tools.
The paper describes the reannotation methodology (p_19-p_20) and mentions using professional 3D annotation tools with human verification. However, the specific number '381 scenes' is not found in the provided text - Table 1 which should contain scale
已证实 (85%) We further enhance evaluation controllability by providing variants across multiple frame budgets (16/32/64/all) and finegrained object visibility metadata, enabling controlled diagnostic analyses.
The paper clearly describes constructing QA pairs for 16/32/64/all-frame settings (p_31) and computing fine-grained object visibility metadata. The utility of these features for controlled diagnostic analyses is demonstrated in experiments (Tables 5,
已证实 (90%) We developed a 3D web interface and reannotated object labels and 3D bounding boxes across ScanNetv2, ScanNet++, ARKitScenes, 3RScan, and MultiScan.
The paper explicitly states in p_19: 'We developed a 3D web interface (Appendix B.1) and reannotated object labels and 3D bounding boxes across ScanNetv2, ScanNet++, ARKitScenes, 3RScan, and MultiScan.' All five datasets are named and the development
已证实 (85%) We construct dummy-videos by removing frames containing the queried objects, enabling controlled analysis of how models rely on visual evidence.
The paper describes the dummy-video construction methodology in detail (p_33-p_35) and demonstrates its use for controlled analysis in experiments (Tables 5, 6). The construction process is clearly specified: removing frames containing queried object
已证实 (80%) We propose frame-budgeted evaluation protocols and visibility-guided diagnostics that expose substantial differences in models' sensitivity to visual evidence, reliance on scene priors, and hallucination behavior.
The paper describes the protocols (p_31-p_32, p_33-p_35) and demonstrates their utility through experiments showing different hallucination rates between models (p_46-p_48). The frame-budgeted results (Tables 12, 13 referenced in p_88) and dummy-vide
已证实 (80%) Proprietary models are under-assessed by VSI-Bench (e.g., on object counting), while fine-tuned models show high hallucination rate under the dummy-videos setting.
The paper provides evidence for both parts: p_39-p_40 discusses proprietary model under-assessment on VSI-Bench, and p_46 states 'all specialized finetuned models fail catastrophically on object counting' in dummy-video settings. However, exact hallu
证据不足 (50%) All evaluated proprietary models obtain stable or even higher performance on our improved benchmark, while open-source models exhibit substantial accuracy drops (up to 40%), especially on object counting, relative distance, and relative direction tasks.
The claim is stated in p_10 with reference to Table 3, but the table is not provided in the text. The specific 'up to 40%' accuracy drop cannot be verified without seeing the actual numerical results. The finding is asserted but the underlying quanti
证据不足 (55%) Models finetuned on 3D spatial reasoning data show significantly smaller gains on ReVSI than on VSI-Bench. Moreover, scaling post-training data does not consistently improve performance, with some fine-tuned models performing worse on specific tasks than their base model.
The paper states these findings (p_41-p_45) with reference to Table 4, but the table is not provided. The specific claim about 'smaller gains' and 'some fine-tuned models performing worse' cannot be quantitatively verified. The ~3% marginal gain from
已证实 (80%) Several models (e.g., InternVL3.5) still achieve surprisingly high scores on evidence-absent controls, revealing predictions driven by non-visual priors rather than visual evidence.
The paper provides specific examples in p_47-p_48: InternVL3.5 'consistently predicts 2 on black videos' for object counting and 'achieving high task scores even on black videos' for size estimation. While exact scores require Tables 5 and 6, the beh
证据不足 (45%) Reducing the sampling rate substantially degrades evaluation validity, particularly when fewer than 32 frames are used, a setting commonly adopted in VLM fine-tuning and benchmarking.
The claim is stated in p_17 with reference to Figure 3 and Table 7, but neither is provided in the text. The finding about evaluation validity degradation below 32 frames cannot be verified without the underlying data.
证据不足 (40%) Appearance Order questions remain severely impacted by object absence, resulting in persistent inaccuracies even with 64 frames.
The claim is stated in p_17 with reference to Figure 3 and Table 7, which are not provided. Additionally, p_21 notes that Object Appearance Order task is excluded from ReVSI, so this finding applies to VSI-Bench analysis. The quantitative evidence is
已证实 (85%) Predicting '2' alone achieves 62% accuracy on VSI-Bench object counting, revealing strong dataset and environment bias.
The specific number '62% accuracy' is directly stated in p_22: 'predicting "2" alone achieves 62% accuracy, revealing strong dataset and environment bias (Figure 5).' This is a concrete quantitative finding about VSI-Bench bias.
证据不足 (50%) Across all numerical tasks, and particularly for object counting, VSI-Bench systematically underestimates the performance of proprietary models, making open-source models appear substantially stronger.
The claim is stated in p_39 with reference to Table 3, but the table is not provided. The finding about systematic underestimation of proprietary models cannot be verified without the numerical comparison data.
证据不足 (50%) Most models achieve higher scores on ReVSI for absolute distance estimation, despite ReVSI being more challenging due to the removal of short-range (<1m) questions and the inclusion of more long-range queries.
The claim is stated in p_40 with reference to Table 3 and Figure 5, neither of which is provided. The finding about higher scores on ReVSI for absolute distance estimation cannot be verified without the numerical data.
证据不足 (45%) Modern models such as Qwen3-VL demonstrate strong longrange distance estimation, with absolute error increasing noticeably only beyond 6m.
The claim is stated in p_40 with reference to Figure 7, which is not provided. The specific threshold '6m' for when absolute error increases noticeably cannot be verified without the underlying visualization data.
证据不足 (50%) On ReVSI, all finetuned models show substantially smaller improvements over their base models, and in several cases (e.g., SpaceR), fine-tuning even leads to performance degradation across multiple tasks.
The claim is stated in p_42 with reference to Table 4, which is not provided. The specific example of SpaceR performance degradation cannot be verified without the numerical comparison data.
已证实 (85%) Increasing the training size for Spatial-MLLM from 135k to 820k samples results in only marginal gains of ~3%, suggesting that data quality and supervision fidelity, rather than quantity alone, are the primary bottlenecks.
The specific numbers are directly stated in p_45: 'increasing the training size for Spatial-MLLM from 135k to 820k samples results in only marginal gains of ~3%.' This is a concrete quantitative finding.
证据不足 (50%) Human performance exhibits zero hallucination across all dummy-video settings.
The claim is stated in p_46 with reference to Table 5, which is not provided. While 'zero hallucination' is a strong claim, the underlying data showing human performance on dummy videos is not available for verification.
证据不足 (50%) Proprietary models consistently have lower hallucination rates than open-source models, indicating a stronger tendency to ground predictions in visual evidence rather than priors.
The claim is stated in p_46 with reference to Table 5, which is not provided. The comparison of hallucination rates between proprietary and open-source models cannot be verified without the numerical data.

... 共 47 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

局限性(作者自述)

本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv

分析时间:2026-04-29T07:32:44+00:00 · 数据来源:Paper Collector