The paper identifies validity issues in VSI-Bench for VLM spatial reasoning: annotation drift and scene-observability mismatch. The authors introduce ReVSI, re-annotating 381 scenes with frame-budgeted protocols and dummy-video diagnostics.
核心问题
What are the validity issues in existing spatial reasoning benchmarks for VLMs, and how can benchmark design be improved to accurately assess 3D spatial reasoning capabilities?
核心方法
{'approach': 'The authors re-annotate objects and geometry across 381 scenes from 5 datasets using professional 3D tools, regenerate QA pairs with bias mitigation and human verification, and construct frame-budgeted evaluation variants (16/32/64/all frames). They introduce dummy-video diagnostics that remove frames containing queried objects to test whether models rely on visual evidence versus memorized priors.', 'key_components': [], 'section_ids': []}
论点验证
The paper provides detailed methodology for ReVSI (p_7-p_8, p_19-p_20, p_31-p_32) including reannotation, human verification, and frame-budgeted QA construction. However, the claim that QA pairs are 'answerable and correct' is primarily demonstrated
The paper describes the reannotation methodology (p_19-p_20) and mentions using professional 3D annotation tools with human verification. However, the specific number '381 scenes' is not found in the provided text - Table 1 which should contain scale
The paper clearly describes constructing QA pairs for 16/32/64/all-frame settings (p_31) and computing fine-grained object visibility metadata. The utility of these features for controlled diagnostic analyses is demonstrated in experiments (Tables 5,
The paper explicitly states in p_19: 'We developed a 3D web interface (Appendix B.1) and reannotated object labels and 3D bounding boxes across ScanNetv2, ScanNet++, ARKitScenes, 3RScan, and MultiScan.' All five datasets are named and the development
The paper describes the dummy-video construction methodology in detail (p_33-p_35) and demonstrates its use for controlled analysis in experiments (Tables 5, 6). The construction process is clearly specified: removing frames containing queried object
The paper describes the protocols (p_31-p_32, p_33-p_35) and demonstrates their utility through experiments showing different hallucination rates between models (p_46-p_48). The frame-budgeted results (Tables 12, 13 referenced in p_88) and dummy-vide
The paper provides evidence for both parts: p_39-p_40 discusses proprietary model under-assessment on VSI-Bench, and p_46 states 'all specialized finetuned models fail catastrophically on object counting' in dummy-video settings. However, exact hallu
The claim is stated in p_10 with reference to Table 3, but the table is not provided in the text. The specific 'up to 40%' accuracy drop cannot be verified without seeing the actual numerical results. The finding is asserted but the underlying quanti
The paper states these findings (p_41-p_45) with reference to Table 4, but the table is not provided. The specific claim about 'smaller gains' and 'some fine-tuned models performing worse' cannot be quantitatively verified. The ~3% marginal gain from
The paper provides specific examples in p_47-p_48: InternVL3.5 'consistently predicts 2 on black videos' for object counting and 'achieving high task scores even on black videos' for size estimation. While exact scores require Tables 5 and 6, the beh
The claim is stated in p_17 with reference to Figure 3 and Table 7, but neither is provided in the text. The finding about evaluation validity degradation below 32 frames cannot be verified without the underlying data.
The claim is stated in p_17 with reference to Figure 3 and Table 7, which are not provided. Additionally, p_21 notes that Object Appearance Order task is excluded from ReVSI, so this finding applies to VSI-Bench analysis. The quantitative evidence is
The specific number '62% accuracy' is directly stated in p_22: 'predicting "2" alone achieves 62% accuracy, revealing strong dataset and environment bias (Figure 5).' This is a concrete quantitative finding about VSI-Bench bias.
The claim is stated in p_39 with reference to Table 3, but the table is not provided. The finding about systematic underestimation of proprietary models cannot be verified without the numerical comparison data.
The claim is stated in p_40 with reference to Table 3 and Figure 5, neither of which is provided. The finding about higher scores on ReVSI for absolute distance estimation cannot be verified without the numerical data.
The claim is stated in p_40 with reference to Figure 7, which is not provided. The specific threshold '6m' for when absolute error increases noticeably cannot be verified without the underlying visualization data.
The claim is stated in p_42 with reference to Table 4, which is not provided. The specific example of SpaceR performance degradation cannot be verified without the numerical comparison data.
The specific numbers are directly stated in p_45: 'increasing the training size for Spatial-MLLM from 135k to 820k samples results in only marginal gains of ~3%.' This is a concrete quantitative finding.
The claim is stated in p_46 with reference to Table 5, which is not provided. While 'zero hallucination' is a strong claim, the underlying data showing human performance on dummy videos is not available for verification.
The claim is stated in p_46 with reference to Table 5, which is not provided. The comparison of hallucination rates between proprietary and open-source models cannot be verified without the numerical data.
... 共 47 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- Dataset details: No information about ReVSI dataset construction, size, sample distribution, or question types
- Data availability: No link or access to the ReVSI benchmark dataset
- Code availability: No code repository provided for evaluation pipeline or data processing
- Model inference parameters: No details on temperature, top-p, max tokens, or other generation parameters for VLMs
- API specifications: No API versions, access dates, or specific model version identifiers for GPT-5.2 and Gemini 3
- Prompt templates: No prompt engineering details or instruction formats used for evaluation
- Frame sampling methodology: No details on how frames are sampled from videos (uniform, keyframe detection, etc.)
- Hardware/environment specifications: No information on computational resources used
- Random seeds: No random seed specifications for reproducibility
- Preprocessing steps: No details on input image/video preprocessing or normalization
局限性(作者自述)
- The high-quality 3D indoor spatial intelligence dataset introduced in this work relies on costly expert-level human annotation, which limits scalability to substantially larger datasets or training-scale supervision.
- Developing automated or semi-automated pipelines for generating high-quality spatial supervision remains an important direction for future work.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-29T07:32:44+00:00 · 数据来源:Paper Collector