OpenSeeker-v2 shows simple supervised fine-tuning achieves state-of-the-art search agent performance using high-quality, high-difficulty data. With three synthesis modifications and 10k trajectories, it outperforms complex industrial pipelines on four benchmarks, demonstrating data quality's critic…
核心问题
Can straightforward supervised fine-tuning achieve state-of-the-art performance for search agents when using high-quality, high-difficulty training data, without requiring complex multi-stage training pipelines?
核心方法
{'approach': 'The authors modify the data synthesis pipeline with three key changes: scaling graph size from k to K for richer multi-hop reasoning paths, expanding the tool set for diverse interaction patterns, and applying strict low-step filtering to discard simple trajectories. The model is trained using standard SFT on Qwen3-30B-A3B-Thinking-2507 without RL or additional hyperparameter tuning.', 'key_components': ['The central hypothesis is that difficult and information-rich training data makes simple SFT sufficient for strong search and reasoning abilities.', 'Graph size scaling (K > k) increases contextual richness and multi-hop dependency in generated queries.', 'Tool set expansion encourages agents to learn diverse interaction patterns and leverage complementary tools.', 'Strict low-step filtering removes simple instances solvable by direct lookup or shallow keyword matching.', 'The standard SFT objective is applied over the filtered dataset without RL or additional hyperparameter tuning.'], 'section_ids': ['sec_2', 'sec_3']}
论点验证
The paper provides quantitative benchmark results (46.0%, 58.1%, 34.6%, 78.0%) and comparisons against Tongyi DeepResearch showing SFT-only approach achieves competitive performance. However, the causal claim that 'data quality' is the key factor is
The method of scaling graph size is described, but no specific values for k and K are provided, and no ablation study demonstrates that this actually produces 'more complex tasks' or 'multi-hop exploration'. The claimed effect is asserted but not val
The paper mentions expanding the tool set but provides no specific numbers (how many tools before/after), and no ablation study demonstrates that this leads to 'more versatile strategies'. The claimed benefit is stated but not empirically validated.
The low-step filtering method is described, but the threshold T_min is not specified, and no analysis demonstrates that filtered trajectories are indeed 'simple' or that this filtering improves agent performance. No ablation comparing filtered vs unf
The paper explicitly states the dataset size as 'merely 10k' in p_6 and '10.6k' in p_26. This is a straightforward factual claim with specific numbers provided.
This is a claim about open-sourcing model weights. While the paper states the intention, verifying whether the weights are actually released requires checking external resources beyond the paper itself.
The SOTA claim among ~30B ReAct agents is supported by benchmark data. However, claims about being 'first' and 'purely academic team' are historical/organizational assertions not substantiated with evidence in the paper. No citation or evidence prove
Specific benchmark scores are provided (46.0%, 58.1%, 34.6%, 78.0%) and comparisons in p_21-p_22 show these are the highest among reported ~30B ReAct-based baselines. The SOTA claim within this scope is well-supported.
Specific numerical comparisons are provided: Tongyi DeepResearch achieves 43.4%, 46.7%, 32.9%, 75.0% vs OpenSeeker-v2's 46.0%, 58.1%, 34.6%, 78.0%. The outperformance is clearly demonstrated with concrete data.
The claim references 'these two' models (Tongyi DeepResearch and RedSearcher) but RedSearcher's specific benchmark numbers are not provided in the text. For Tongyi DeepResearch, the numbers can be partially verified (BrowseComp: 2.6%✓, HLE: 1.7%≥0.3%
The paper claims OpenSeeker-v2 outperforms DeepSeek-V3.1-671B, GLM-4.6-357B, Minimax-M2-230B, and Claude-4.5-Sonnet, but no specific benchmark scores for these models are provided in the text. The comparison is stated but not substantiated with data.
The paper states OpenSeeker-v2 'substantially improves upon OpenSeeker-v1' but provides no quantitative comparison, benchmark scores for v1, or improvement percentages. The claim is made without supporting data.
While benchmark results show SFT works well, the causal interpretation that 'data quality is a critical path' is not rigorously validated. No ablation studies isolate data quality vs quantity vs other factors. The conclusion goes beyond what the evid
This is a straightforward factual statement about model architecture. The paper explicitly states the base model (Qwen3-30B-A3B-Thinking-2507) with 30B total and 3B activated parameters.
Straightforward factual statement about model configuration. The paper explicitly states '256k context window and allows up to 200 tool calls per trajectory'.
Straightforward methodological statement. The paper explicitly states 'OpenSeeker-v2 is trained with SFT, without RL or additional hyperparameter tuning'.
The claim states 'five challenging agentic benchmarks' but only lists four: BrowseComp, BrowseComp-ZH, Humanity's Last Exam (HLE), and xbench-DeepSearch. This is an internal inconsistency - the number doesn't match the list provided.
Straightforward methodological statement about preventing data leakage. The paper states 'We mask the hugging-face-related links when calling the web search tools to avoid potential leakage'.
The paper references Table 1 and describes the baseline comparison approach with focus on comparable-scale ReAct-based search agents in p_20.
The design choice is described (increasing from k to K where K > k) but no specific values are provided. Without concrete numbers, the magnitude of this change cannot be assessed.
... 共 32 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- Exact hyperparameters for SFT training (learning rate, batch size, number of epochs, optimizer settings, weight decay, warmup steps)
- Exact values for key parameters: expansion budget K, original budget k, and minimum tool-call threshold T_min
- Complete specification of the expanded tool set A (what tools are included, their implementations)
- Training data details: source graph G specification, dataset size, data splits, how seed nodes are selected
- Random seeds for reproducibility
- Hardware specifications (GPU type, number of GPUs, memory requirements, training duration)
- Software environment details (framework, PyTorch version, dependencies)
- Exact prompts and templates used for query generation and agent interaction
- Data preprocessing and filtering implementation details
- Evaluation protocol details (exact prompts, how answers are extracted and scored)
局限性(作者自述)
- Though trained with only 10.6k samples, OpenSeeker-v2 achieves a new SOTA across four representative benchmarks
- Moving forward, we will continue to push in this direction by scaling up data quantity, quality, and diversity, with the goal of further pushing the limits of search agents.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-05-07T13:07:51+00:00 · 数据来源:Paper Collector