OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories - AI 论文深度分析

TL;DR
OpenSeeker-v2 shows simple supervised fine-tuning achieves state-of-the-art search agent performance using high-quality, high-difficulty data. With three synthesis modifications and 10k trajectories, it outperforms complex industrial pipelines on four benchmarks, demonstrating data quality's critic…

已证实

证据不足

无法验证

N/A

可复现性

置信度

66%

核心问题

Can straightforward supervised fine-tuning achieve state-of-the-art performance for search agents when using high-quality, high-difficulty training data, without requiring complex multi-stage training pipelines?

核心方法

{'approach': 'The authors modify the data synthesis pipeline with three key changes: scaling graph size from k to K for richer multi-hop reasoning paths, expanding the tool set for diverse interaction patterns, and applying strict low-step filtering to discard simple trajectories. The model is trained using standard SFT on Qwen3-30B-A3B-Thinking-2507 without RL or additional hyperparameter tuning.', 'key_components': ['The central hypothesis is that difficult and information-rich training data makes simple SFT sufficient for strong search and reasoning abilities.', 'Graph size scaling (K > k) increases contextual richness and multi-hop dependency in generated queries.', 'Tool set expansion encourages agents to learn diverse interaction patterns and leverage complementary tools.', 'Strict low-step filtering removes simple instances solvable by direct lookup or shallow keyword matching.', 'The standard SFT objective is applied over the filtered dataset without RL or additional hyperparameter tuning.'], 'section_ids': ['sec_2', 'sec_3']}

论点验证

已证实 (75%) In this report, we introduce OpenSeeker-v2, an upgraded search agent that proves a straightforward SFT approach could be sufficiently powerful when fueled by high-quality data of high difficulty and richness.
The paper provides quantitative benchmark results (46.0%, 58.1%, 34.6%, 78.0%) and comparisons against Tongyi DeepResearch showing SFT-only approach achieves competitive performance. However, the causal claim that 'data quality' is the key factor is

证据不足 (50%) Scaling graph size for richer exploration: We significantly expand the topological graph size during data generation. This expansion injects a much richer and more diverse set of source information into the context, enabling the synthesis of highly complex tasks that structurally mandate deep, multi-hop exploration to solve.
The method of scaling graph size is described, but no specific values for k and K are provided, and no ablation study demonstrates that this actually produces 'more complex tasks' or 'multi-hop exploration'. The claimed effect is asserted but not val

证据不足 (50%) Expanding the tool set for broader functionality: We increase the number of available tools, allowing the agent to learn more versatile strategies and handle a wider variety of queries.
The paper mentions expanding the tool set but provides no specific numbers (how many tools before/after), and no ablation study demonstrates that this leads to 'more versatile strategies'. The claimed benefit is stated but not empirically validated.

证据不足 (50%) Strict low-step filtering: We filter out any trajectory that can be resolved in too few tool-call steps. By intentionally dropping these simple queries, we guarantee a strict minimum difficulty floor for the training set, forcing the agent to learn sustained reasoning and information seeking over long horizons.
The low-step filtering method is described, but the threshold T_min is not specified, and no analysis demonstrates that filtered trajectories are indeed 'simple' or that this filtering improves agent performance. No ablation comparing filtered vs unf

已证实 (95%) By applying these two strategies, we curate a highly condensed dataset of merely 10k high-difficulty trajectories.
The paper explicitly states the dataset size as 'merely 10k' in p_6 and '10.6k' in p_26. This is a straightforward factual claim with specific numbers provided.

无法验证 (50%) To democratize frontier search agent research and provide an easily reproducible baseline for the community, we are excited to fully open-source the OpenSeeker-v2 model weights.
This is a claim about open-sourcing model weights. While the paper states the intention, verifying whether the weights are actually released requires checking external resources beyond the paper itself.

证据不足 (50%) OpenSeeker-v2 represents the first state-of-the-art search agent within its model scale and paradigm (ReAct) to be developed entirely by a purely academic team using only SFT.
The SOTA claim among ~30B ReAct agents is supported by benchmark data. However, claims about being 'first' and 'purely academic team' are historical/organizational assertions not substantiated with evidence in the paper. No citation or evidence prove

已证实 (90%) OpenSeeker-v2 achieves a new SOTA across four representative agentic benchmarks: 46.0% on BrowseComp, 58.1% on BrowseComp-ZH, 34.6% on Humanity's Last Exam, and 78.0% on xbench.
Specific benchmark scores are provided (46.0%, 58.1%, 34.6%, 78.0%) and comparisons in p_21-p_22 show these are the highest among reported ~30B ReAct-based baselines. The SOTA claim within this scope is well-supported.

已证实 (90%) This simple SFT baseline decisively outperforms prominent industrial models such as Tongyi DeepResearch, which relies on an extensive CPT+SFT+RL pipeline and achieves 43.4%, 46.7%, 32.9%, and 75.0%, respectively.
Specific numerical comparisons are provided: Tongyi DeepResearch achieves 43.4%, 46.7%, 32.9%, 75.0% vs OpenSeeker-v2's 46.0%, 58.1%, 34.6%, 78.0%. The outperformance is clearly demonstrated with concrete data.

证据不足 (50%) On the challenging benchmarks BrowseComp and HLE, OpenSeeker-v2 outperforms these two by at least 2.6% and 0.3%, respectively; while on the BrowseComp-ZH and xbench, OpenSeeker-v2 significantly outperforms Tongyi DeepResearch by 11.4% and 3%, respectively.
The claim references 'these two' models (Tongyi DeepResearch and RedSearcher) but RedSearcher's specific benchmark numbers are not provided in the text. For Tongyi DeepResearch, the numbers can be partially verified (BrowseComp: 2.6%✓, HLE: 1.7%≥0.3%

证据不足 (40%) OpenSeeker-v2 also outperforms DeepSeek-V3.1-671B, GLM-4.6-357B, Minimax-M2-230B, Claude-4.5-Sonnet, indicating its strong capability.
The paper claims OpenSeeker-v2 outperforms DeepSeek-V3.1-671B, GLM-4.6-357B, Minimax-M2-230B, and Claude-4.5-Sonnet, but no specific benchmark scores for these models are provided in the text. The comparison is stated but not substantiated with data.

证据不足 (30%) OpenSeeker-v2 substantially improves upon OpenSeeker-v1 (Du et al., 2026)
The paper states OpenSeeker-v2 'substantially improves upon OpenSeeker-v1' but provides no quantitative comparison, benchmark scores for v1, or improvement percentages. The claim is made without supporting data.

证据不足 (50%) These results demonstrate that a straightforward SFT approach can be sufficiently powerful when fueled by high-quality data of high difficulty and richness, suggesting that data quality could be a critical path towards training intelligent long-horizon search agents.
While benchmark results show SFT works well, the causal interpretation that 'data quality is a critical path' is not rigorously validated. No ablation studies isolate data quality vs quantity vs other factors. The conclusion goes beyond what the evid

已证实 (95%) We instantiate OpenSeeker-v2 from Qwen3-30B-A3B-Thinking-2507 (Team, 2025), which has 30B total parameters and 3B activated parameters during inference.
This is a straightforward factual statement about model architecture. The paper explicitly states the base model (Qwen3-30B-A3B-Thinking-2507) with 30B total and 3B activated parameters.

已证实 (95%) The agent uses a 256k context window and allows up to 200 tool calls per trajectory.
Straightforward factual statement about model configuration. The paper explicitly states '256k context window and allows up to 200 tool calls per trajectory'.

已证实 (90%) OpenSeeker-v2 is trained with SFT, without RL or additional hyperparameter tuning.
Straightforward methodological statement. The paper explicitly states 'OpenSeeker-v2 is trained with SFT, without RL or additional hyperparameter tuning'.

无法验证 (95%) We evaluate OpenSeeker-v2 on five challenging agentic benchmarks: BrowseComp (Wei et al., 2025), BrowseComp-ZH (Zhou et al., 2025), Humanity's Last Exam (HLE) (Phan et al., 2025), and xbench-DeepSearch (Xbench-Team, 2025).
The claim states 'five challenging agentic benchmarks' but only lists four: BrowseComp, BrowseComp-ZH, Humanity's Last Exam (HLE), and xbench-DeepSearch. This is an internal inconsistency - the number doesn't match the list provided.

已证实 (90%) We mask the hugging-face-related links when calling the web search tools to avoid potential leakage.
Straightforward methodological statement about preventing data leakage. The paper states 'We mask the hugging-face-related links when calling the web search tools to avoid potential leakage'.

已证实 (85%) We compare OpenSeeker-v2 with representative systems in Table 1, with a primary focus on comparable-scale ReAct-based search agents.
The paper references Table 1 and describes the baseline comparison approach with focus on comparable-scale ReAct-based search agents in p_20.

证据不足 (50%) In OpenSeeker-v2, we increase the expansion budget from k to K, where K > k, and obtain a larger evidence subgraph.
The design choice is described (increasing from k to K where K > k) but no specific values are provided. Without concrete numbers, the magnitude of this change cannot be assessed.

... 共 32 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

Exact hyperparameters for SFT training (learning rate, batch size, number of epochs, optimizer settings, weight decay, warmup steps)
Exact values for key parameters: expansion budget K, original budget k, and minimum tool-call threshold T_min
Complete specification of the expanded tool set A (what tools are included, their implementations)
Training data details: source graph G specification, dataset size, data splits, how seed nodes are selected
Random seeds for reproducibility
Hardware specifications (GPU type, number of GPUs, memory requirements, training duration)
Software environment details (framework, PyTorch version, dependencies)
Exact prompts and templates used for query generation and agent interaction
Data preprocessing and filtering implementation details
Evaluation protocol details (exact prompts, how answers are extracted and scored)

局限性（作者自述）

Though trained with only 10.6k samples, OpenSeeker-v2 achieves a new SOTA across four representative benchmarks
Moving forward, we will continue to push in this direction by scaling up data quantity, quality, and diversity, with the goal of further pushing the limits of search agents.

本分析由 PDF 阅读助手自动生成，仅供参考，不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析，可能存在偏差。原始论文请参阅 arXiv。

分析时间：2026-05-07T13:07:51+00:00 · 数据来源：Paper Collector