OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents - AI 论文深度分析

TL;DR
OpenSearch-VL presents a fully open recipe for training multimodal deep search agents with publicly released data, code, and models. Using image-grounded multi-hop data and fatal-aware GRPO, OpenSearch-VL-30B-A3B achieves 13.8-point improvement across 7 benchmarks, outperforming proprietary models.

已证实

证据不足

无法验证

N/A

可复现性

置信度

86%

核心问题

How can we create a fully open, reproducible recipe for training frontier multimodal deep search agents that effectively interleave reasoning with tool calls over visual and retrieval tools?

核心方法

{'approach': 'The authors develop a scalable data curation pipeline that synthesizes 36,592 high-quality multi-hop expert trajectories via Wikipedia path sampling and Claude Opus rollouts, followed by a two-stage training process: supervised fine-tuning to instill reasoning behaviors, then reinforcement learning using a novel multi-turn fatal-aware GRPO algorithm with composite rewards and one-sided advantage clamping to handle cascading tool failures.', 'key_components': [], 'section_ids': []}

论点验证

证据不足 (60%) We introduce OpenSearch-VL, a fully open recipe for training frontier multimodal deep search agents.
The paper describes the OpenSearch-VL method in detail, but the 'fully open' claim is a future commitment. The paper states 'we will release' the data, code, and models (p_1, p_64), but no actual release is provided within the paper itself. The metho

无法验证 (95%) We will release the training data, code, and models to provide an open foundation for reproducible research on multimodal agentic search.
This is a future commitment to release data, code, and models. The paper states 'we will release' (p_1, p_64) but no actual release is provided. Future release promises cannot be verified from the paper alone.

已证实 (85%) We build the key components required for training advanced multimodal search agents, including high-quality image-grounded multi-hop training data, a diverse tool environment, and a multi-turn fatal-aware GRPO algorithm.
The paper provides detailed descriptions of all three components: (1) the data curation pipeline in Section 3, (2) the tool environment in Section 2 with Table 1, and (3) the fatal-aware GRPO algorithm in Section 4.2. The experimental results demonst

已证实 (75%) our trained OpenSearch-VL-30B-A3B brings an average improvement of 13.8 points across 7 multimodal deep search benchmarks.
The paper states this specific improvement in p_1 and provides average scores in p_53 (OpenSearch-VL-30B-A3B achieves 61.6 average). However, the exact baseline comparison yielding 13.8 points is not explicitly shown in the visible text - it should b

已证实 (90%) Unlike text-only formulations (Jin et al., 2025), our environment E returns multimodal observations.
The paper explicitly states this design choice in p_5-6, describing how the environment E returns multimodal observations O = O_img ∪ O_txt, contrasting with text-only formulations like Search-R1 (Jin et al., 2025).

已证实 (90%) The active visual context grows monotonically as I l = {I 0 } ∪ {o k : k < l, o k ∈ O img }; historical visual observations are strictly preserved so that the policy can cross-reference multi-hop visual transformations (e.g. a localised Crop against its SuperResolution-enhanced counterpart).
The paper explicitly describes this design in p_6, defining the active visual context I^l and explaining how historical visual observations are preserved for cross-referencing multi-hop transformations.

已证实 (90%) OpenSearch-VL is equipped with a suite of tools covering three complementary functions: retrieval (TextSearch, ImageSearch) for gathering external evidence, image enhancement (Sharpen, SuperResolution, PerspectiveCorrect) for remedying low-quality inputs, and attention and parsing (Crop, OCR) for localizing and decoding fine-grained content.
The paper explicitly lists all seven tools with their functions in p_11 and references Table 1 for the complete specification.

已证实 (85%) we design a scalable data curation pipeline (Figure 1) that synthesizes high-quality trajectories without manual human annotation.
The paper describes the complete data curation pipeline in Section 3, with Figure 1 referenced. The pipeline produces 36,592 trajectories (p_27) without manual annotation, using automated processes for VQA construction, filtering, and trajectory synt

已证实 (90%) we adopt a unified construction pipeline: we sample multi-hop trajectories over the Wikipedia hyperlink graph, synthesize textual QA pairs along each trajectory, and lift them into image-grounded VQA via answer-preserving fuzzy rewriting and source-anchored visual grounding.
The paper explicitly describes this unified construction pipeline in p_15, detailing the Wikipedia path sampling, textual QA synthesis, and image-grounded VQA lifting process.

证据不足 (60%) Compared with prior QA constructions (Geng et al., 2025;Li et al., 2025b;Wu et al., 2025a), our pipeline (i) assigns each node on the sampled path an explicit functional role within the reasoning chain, and (ii) deliberately decouples the visual anchor from the answer entity, thereby suppressing single-shot retrieval shortcuts.
The design choices (functional roles, decoupled visual anchor) are clearly described in p_15 and p_17. However, the effectiveness claim 'suppressing single-shot retrieval shortcuts' is stated but not quantitatively demonstrated. No ablation or analys

已证实 (95%) We cast the Wikipedia (wik) as a directed graph G = (V, E) with articles as nodes and in-article hyperlinks as edges.
This is a straightforward design choice clearly stated in p_16, defining Wikipedia as a directed graph with articles as nodes and hyperlinks as edges.

已证实 (95%) Each node on P is assigned a functional role: v 0 is the anchor (visual entry point, to be replaced by a visual referring expression), v 1 , . . . , v h-1 are bridge nodes (intermediate entities with fuzzified names), and v h is the answer node (source of the target attribute).
The paper explicitly defines the functional roles for each node in p_17: anchor (v_0), bridge nodes (v_1 to v_{h-1}), and answer node (v_h).

已证实 (90%) We extract a short, unambiguous answer a from v h and prompt GPT-4o (Team, 2024) to synthesize a canonical question q t that verbalizes P and references v h only through the queried attribute
The paper describes this process in p_18, explaining how answers are extracted and canonical questions are synthesized using GPT-4o.

已证实 (90%) we progressively rewrite q t into a fuzzy counterpart q f while fixing a.
The paper describes the fuzzy rewriting process in p_19-20, explaining the progressive entity rewriting while preserving the answer.

已证实 (90%) We retrieve a representative image I of the anchor v 0 from Wikimedia Commons or its Wikipedia infobox, filter candidates by CLIP similarity to a short textual description of v 0 , and replace v 0 in q f with a visual referring expression (e.g., "the person in the image") to yield the final question q.
The paper describes the visual grounding process in p_21, including image retrieval from Wikimedia Commons, CLIP filtering, and visual referring expression substitution.

证据不足 (50%) anchoring v 0 at the source of P substantially reduces single-hop shortcuts: the agent must first identify the visual anchor and then follow the intermediate textual relations before reaching a.
The design choice is described, but the effectiveness claim 'substantially reduces single-hop shortcuts' lacks quantitative evidence. The paper states this benefit but does not provide ablation or analysis comparing shortcut rates with vs. without th

已证实 (90%) we consolidate the Wikipedia-derived VQA instances from Sec. 3.1 with three open-source multimodal corpora-LiveVQA (Fu et al., 2025), FVQA (Wang et al., 2017), and WebQA (Chang et al., 2022)-to broaden coverage across live entities, commonsense fact lookup, and open-web multi-hop reasoning.
The paper explicitly states this consolidation in p_23, listing the three corpora and their purposes.

已证实 (90%) we apply a two-stage difficulty filter using a frozen Qwen3-VL-32B (Bai et al., 2025): first discarding examples answerable without tools, and then discarding examples solvable with a single ImageSearch call.
The paper describes the two-stage difficulty filter in p_23, using Qwen3-VL-32B to discard examples answerable without tools or with single ImageSearch.

已证实 (90%) we randomly select 10% of the filtered VQA pool and apply controlled degradations-blur, downsampling, and perspective distortion-paired with the corresponding enhancement tools in T v (Sharpen, SuperResolution, and PerspectiveCorrect).
The paper explicitly states this design in p_23, specifying the 10% selection and the degradation-tool pairings.

证据不足 (50%) This enhancement subset diversifies the training distribution and induces a think-with-image behavior: when the input image is unreliable, the policy learns to repair the visual evidence before initiating retrieval.
The design choice is described, but the effectiveness claim 'induces a think-with-image behavior' is not quantitatively demonstrated. No ablation shows that models trained with this subset actually learn to repair visual evidence before retrieval.

... 共 60 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

Training hyperparameters (learning rate, batch size, epochs, optimizer settings) for both SFT and RL stages
Random seeds for reproducibility
Hardware specifications (GPU/TPU type, number of devices, memory requirements)
Training data details (size, composition, creation methodology, data splits)
RL reward function design and implementation details
Inference/generation parameters (temperature, top-p, max tokens, etc.)
Specific tool implementations (Crop, Sharpen, SuperResolution, PerspectiveCorrect, OCR)
Training duration and computational cost
LlamaFactory, rLLM, and Vision-DeepResearch configuration files or modifications
Data preprocessing and formatting procedures for interleaved multimodal trajectories

局限性（作者自述）

A non-trivial fraction of training instability traces to the external tool environment E-including search ranking drift, fetch failures, and occasional summarization hallucinations in TextSearch and ImageSearch-which inflates reward variance and motivates future work on on-policy reliability estimation.
our composite reward (Eq. 9) relies on proprietary GPT-4o judges, which are costly, version-dependent, and currently score only textual queries while ignoring intermediate visual operations (e.g., Crop)
replacing these with open process reward models covering the full visual action space T v remains a natural next step.
exact numerical reproducibility is challenged by the reliance on these externally hosted APIs (e.g., Serper, PaddleX OCR) and the prohibitive cost of reporting multi-seed error bars for large-scale evaluations

本分析由 PDF 阅读助手自动生成，仅供参考，不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析，可能存在偏差。原始论文请参阅 arXiv。

分析时间：2026-05-08T01:11:32+00:00 · 数据来源：Paper Collector