OpenSearch-VL presents a fully open recipe for training multimodal deep search agents with publicly released data, code, and models. Using image-grounded multi-hop data and fatal-aware GRPO, OpenSearch-VL-30B-A3B achieves 13.8-point improvement across 7 benchmarks, outperforming proprietary models.
核心问题
How can we create a fully open, reproducible recipe for training frontier multimodal deep search agents that effectively interleave reasoning with tool calls over visual and retrieval tools?
核心方法
{'approach': 'The authors develop a scalable data curation pipeline that synthesizes 36,592 high-quality multi-hop expert trajectories via Wikipedia path sampling and Claude Opus rollouts, followed by a two-stage training process: supervised fine-tuning to instill reasoning behaviors, then reinforcement learning using a novel multi-turn fatal-aware GRPO algorithm with composite rewards and one-sided advantage clamping to handle cascading tool failures.', 'key_components': [], 'section_ids': []}
论点验证
The paper describes the OpenSearch-VL method in detail, but the 'fully open' claim is a future commitment. The paper states 'we will release' the data, code, and models (p_1, p_64), but no actual release is provided within the paper itself. The metho
This is a future commitment to release data, code, and models. The paper states 'we will release' (p_1, p_64) but no actual release is provided. Future release promises cannot be verified from the paper alone.
The paper provides detailed descriptions of all three components: (1) the data curation pipeline in Section 3, (2) the tool environment in Section 2 with Table 1, and (3) the fatal-aware GRPO algorithm in Section 4.2. The experimental results demonst
The paper states this specific improvement in p_1 and provides average scores in p_53 (OpenSearch-VL-30B-A3B achieves 61.6 average). However, the exact baseline comparison yielding 13.8 points is not explicitly shown in the visible text - it should b
The paper explicitly states this design choice in p_5-6, describing how the environment E returns multimodal observations O = O_img ∪ O_txt, contrasting with text-only formulations like Search-R1 (Jin et al., 2025).
The paper explicitly describes this design in p_6, defining the active visual context I^l and explaining how historical visual observations are preserved for cross-referencing multi-hop transformations.
The paper explicitly lists all seven tools with their functions in p_11 and references Table 1 for the complete specification.
The paper describes the complete data curation pipeline in Section 3, with Figure 1 referenced. The pipeline produces 36,592 trajectories (p_27) without manual annotation, using automated processes for VQA construction, filtering, and trajectory synt
The paper explicitly describes this unified construction pipeline in p_15, detailing the Wikipedia path sampling, textual QA synthesis, and image-grounded VQA lifting process.
The design choices (functional roles, decoupled visual anchor) are clearly described in p_15 and p_17. However, the effectiveness claim 'suppressing single-shot retrieval shortcuts' is stated but not quantitatively demonstrated. No ablation or analys
This is a straightforward design choice clearly stated in p_16, defining Wikipedia as a directed graph with articles as nodes and hyperlinks as edges.
The paper explicitly defines the functional roles for each node in p_17: anchor (v_0), bridge nodes (v_1 to v_{h-1}), and answer node (v_h).
The paper describes this process in p_18, explaining how answers are extracted and canonical questions are synthesized using GPT-4o.
The paper describes the fuzzy rewriting process in p_19-20, explaining the progressive entity rewriting while preserving the answer.
The paper describes the visual grounding process in p_21, including image retrieval from Wikimedia Commons, CLIP filtering, and visual referring expression substitution.
The design choice is described, but the effectiveness claim 'substantially reduces single-hop shortcuts' lacks quantitative evidence. The paper states this benefit but does not provide ablation or analysis comparing shortcut rates with vs. without th
The paper explicitly states this consolidation in p_23, listing the three corpora and their purposes.
The paper describes the two-stage difficulty filter in p_23, using Qwen3-VL-32B to discard examples answerable without tools or with single ImageSearch.
The paper explicitly states this design in p_23, specifying the 10% selection and the degradation-tool pairings.
The design choice is described, but the effectiveness claim 'induces a think-with-image behavior' is not quantitatively demonstrated. No ablation shows that models trained with this subset actually learn to repair visual evidence before retrieval.
... 共 60 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- Training hyperparameters (learning rate, batch size, epochs, optimizer settings) for both SFT and RL stages
- Random seeds for reproducibility
- Hardware specifications (GPU/TPU type, number of devices, memory requirements)
- Training data details (size, composition, creation methodology, data splits)
- RL reward function design and implementation details
- Inference/generation parameters (temperature, top-p, max tokens, etc.)
- Specific tool implementations (Crop, Sharpen, SuperResolution, PerspectiveCorrect, OCR)
- Training duration and computational cost
- LlamaFactory, rLLM, and Vision-DeepResearch configuration files or modifications
- Data preprocessing and formatting procedures for interleaved multimodal trajectories
局限性(作者自述)
- A non-trivial fraction of training instability traces to the external tool environment E-including search ranking drift, fetch failures, and occasional summarization hallucinations in TextSearch and ImageSearch-which inflates reward variance and motivates future work on on-policy reliability estimation.
- our composite reward (Eq. 9) relies on proprietary GPT-4o judges, which are costly, version-dependent, and currently score only textual queries while ignoring intermediate visual operations (e.g., Crop)
- replacing these with open process reward models covering the full visual action space T v remains a natural next step.
- exact numerical reproducibility is challenged by the reliance on these externally hosted APIs (e.g., Serper, PaddleX OCR) and the prohibitive cost of reporting multi-seed error bars for large-scale evaluations
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-05-08T01:11:32+00:00 · 数据来源:Paper Collector