TL;DR
This paper introduces Direct Corpus Interaction (DCI), where agents search raw corpora using terminal tools instead of semantic retrievers. DCI achieves 80.0% accuracy on BrowseComp-Plus (+11.0 over baseline with 29.4% cost reduction), 83.0% on QA (+30.7 over baseline), and 68.
32
已证实
3
证据不足
0
无法验证
N/A
可复现性
置信度
89%

核心问题

Can retrieval for agentic search be improved by replacing conventional retriever-mediated interfaces with direct corpus interaction via terminal tools, and how does the retrieval interface itself shape agent capabilities?

核心方法

{'approach': 'The authors implement two DCI agents (DCI-Agent-Lite and DCI-Agent-CC) that search raw corpora using terminal tools like grep, file reads, and shell commands instead of embedding-based retrievers. They evaluate across three benchmark families: BrowseComp-Plus for agentic search, six knowledge-intensive QA datasets, and IR ranking benchmarks (BRIGHT and BEIR), comparing against retrieval agents and sparse/dense retrievers.', 'key_components': [], 'section_ids': []}

论点验证

已证实 (95%) We position direct corpus interaction (DCI) as a new retrieval interface for agentic search. Instead of querying a conventional semantic retriever or retrieval API, the agent searches the raw corpus directly using general-purpose terminal tools such as grep, simple file reads, shell commands, and lightweight scripts.
The paper clearly defines and formalizes DCI in paragraphs 4, 13-15. It provides concrete details about how DCI works: the agent bypasses embedding models and vector indexes, instead using terminal tools (grep, rg, find, glob, file reads) to search t
已证实 (95%) We formalize direct corpus interaction (DCI) as a retrieval paradigm and systematically evaluate it across diverse agentic search settings.
The paper explicitly states this as a key contribution in paragraph 9. DCI is formalized in paragraphs 13-15, and systematically evaluated across three benchmark families: BrowseComp-Plus (agentic search), six knowledge-intensive QA datasets, and six
已证实 (95%) We introduce DCI-Agent-Lite, a lightweight terminal coding agent adapted from Pi and restricted to raw terminal interaction. The agent accesses the corpus through bash and file reads, using general-purpose shell operations such as grep and rg for lexical matching, find and glob for file discovery, together with lightweight scripts.
DCI-Agent-Lite is described in detail in paragraph 17. The paper specifies it is adapted from Pi, restricted to raw terminal interaction, uses bash and file reads, with grep/rg for lexical matching, find/glob for file discovery, and lightweight scrip
已证实 (95%) We implement DCI using Claude Code as an off-the-shelf CLI agent. We name this variant as DCI-Agent-CC. Compared with DCI-Agent-Lite, it provides stronger prompting, more robust tool orchestration, and built-in context handling.
DCI-Agent-CC is described in paragraph 18. The paper specifies it uses Claude Code as an off-the-shelf CLI agent, provides stronger prompting, more robust tool orchestration, and built-in context handling compared to DCI-Agent-Lite. It still operates
已证实 (95%) We equip DCI-Agent-Lite with a lightweight runtime context-management layer built around three mechanisms: Truncation caps the text from each tool call before reinserting it into the live working context; Compaction is an in-memory, zero-LLM operation that clears the contents of older tool-result turns once accumulated tool output exceeds a configured threshold; Summarization replaces compacted history with a model-generated summary while keeping the most recent context intact.
The context-management layer is described in detail in paragraphs 18-21. All three mechanisms are fully specified: Truncation (p_19), Compaction (p_20), and Summarization (p_21), with specific thresholds and behaviors for each.
已证实 (90%) We introduce retrieval interface resolution as a conceptual lens to explain DCI's effectiveness, where 'resolution' denotes the ability to operate on units smaller and more precise than entire documents or passages.
The concept of 'retrieval interface resolution' is introduced in paragraph 6 and listed as a contribution in paragraph 9. The paper defines it as 'the ability to operate on units smaller and more precise than entire documents or passages.' This conce
已证实 (95%) We introduce two trajectory-level metrics: Coverage measures whether a trajectory surfaces the relevant (gold) documents at all, reflecting broad evidence access. Localization measures how efficiently the trajectory narrows to a small, usable evidence span within each surfaced gold document, reflecting within-document evidence isolation.
The two trajectory-level metrics are formally defined in paragraphs 24-31. Coverage is defined in paragraphs 25-27 with three aggregate measures (coverage_any, coverage_mean, coverage_all). Localization is formally defined in paragraphs 28-32 with ma
已证实 (90%) On BrowseComp-Plus, replacing the Qwen3-Embedding-8B retrieval tool with DCI under the same Claude Sonnet 4.6 backbone improves accuracy from 69.0% to 80.0% (+11.0 points) while reducing cost from $1,440 to $1,016 (-29.4%).
Specific quantitative results are provided in paragraphs 5 and 42: accuracy improves from 69.0% to 80.0% (+11.0 points) and cost reduces from $1,440 to $1,016 (-29.4%). These are concrete, verifiable numbers from controlled experiments with the same
已证实 (90%) On multi-hop QA, combining DCI with Claude Code as the command-line interface agent achieves 83.0 average accuracy, surpassing the strongest retrieval-agent baseline (Gao et al., 2025) by 30.7 points.
Specific quantitative results are provided in paragraphs 5 and 43: DCI-Agent-CC achieves 83.0% average accuracy on multi-hop QA, surpassing ASearcher-Local-14B (52.3%) by 30.7 points. The paper provides a table (Table 2) with results across six datas
已证实 (90%) On IR ranking, the same setup reaches 68.5 average NDCG@10, outperforming the best retrieval baseline (Liu et al., 2025) by 21.5 points.
Specific quantitative results are provided in paragraphs 5 and 44: DCI-Agent-CC achieves 68.5 average NDCG@10 on IR ranking, outperforming ReasonRank-32B (47.0%) by 21.5 points. Results are shown in Table 3 across six datasets.
已证实 (90%) DCI-Agent-CC not only outperforms its matched retrieval counterpart but also surpasses the strongest retrieval baseline overall, GPT-5 + Qwen3-Embedding-8B (71.7%), by +8.3 points.
Specific quantitative results are provided in paragraph 42: DCI-Agent-CC achieves 80.0% accuracy, surpassing GPT-5 + Qwen3-Embedding-8B (71.7%) by 8.3 points. The comparison is explicit and the math is verifiable (80.0 - 71.7 = 8.3).
已证实 (90%) DCI-Agent-Lite (GPT-5.4 nano) achieves 62.9% accuracy at a cost of only $93, remaining competitive with substantially stronger retrieval agents such as o3 + Qwen3-Embedding-8B (66.0%) while reducing cost by $647.
Specific quantitative results are provided in paragraph 42: DCI-Agent-Lite achieves 62.9% accuracy at $93 cost, compared to o3 + Qwen3-Embedding-8B at 66.0% with cost reduction of $647. The numbers are concrete and verifiable.
已证实 (90%) DCI-Agent-CC attains 83.0% average accuracy on knowledge-intensive QA, exceeding the strongest baseline, ASearcher-Local-14B (52.3%), by 30.7 points, while DCI-Agent-Lite is also competitive at 68.0%.
Specific quantitative results are provided in paragraph 43 with exact numbers: DCI-Agent-CC at 83.0%, ASearcher-Local-14B at 52.3% (difference of 30.7 points), and DCI-Agent-Lite at 68.0%.
已证实 (90%) Relative to ASearcher-Local-14B, DCI-Agent-CC improves by 30 points on HotpotQA, 26 on 2Wiki, and 50 on MuSiQue.
Specific quantitative results are provided in paragraph 43 with per-dataset breakdowns: 30 points on HotpotQA, 26 on 2Wiki, and 50 on MuSiQue relative to ASearcher-Local-14B.
已证实 (90%) DCI-Agent-CC (68.5%) achieves the best NDCG@10 score on all six datasets, exceeding the strongest retrieval baseline, ReasonRank-32B (47.0%), by 21.5 points on average.
Specific quantitative results are provided in paragraph 44: DCI-Agent-CC achieves 68.5% average NDCG@10, best on all six datasets, exceeding ReasonRank-32B (47.0%) by 21.5 points.
已证实 (90%) DCI-Agent-Lite ranks second overall with an average NDCG@10 of 56.7, still 9.7 points above the strongest retrieval baseline.
Specific quantitative results are provided in paragraph 44: DCI-Agent-Lite ranks second with 56.7 average NDCG@10, 9.7 points above ReasonRank-32B (47.0%).
已证实 (90%) With the same Sonnet 4.6 backbone, DCI-Agent-CC correctly answers 176 BrowseComp-Plus questions that the matched retrieval agent misses, whereas only 76 show the reverse pattern.
Specific quantitative results from trajectory analysis are provided in paragraph 46: DCI-Agent-CC correctly answers 176 questions that the matched retrieval agent misses, while only 76 show the reverse pattern. This is a direct comparison with concre
已证实 (90%) Among the 176 CC-win cases, only 34 contain no gold documents retrieved by the retrieval agent, meaning that 142 of 176 (81%) involve cases where the retrieval agent had already surfaced some or all gold evidence.
Specific quantitative analysis is provided in paragraph 46: among 176 CC-win cases, only 34 contain no gold documents retrieved by the retrieval agent, meaning 142 of 176 (81%) involve cases where retrieval agent had already surfaced some or all gold
已证实 (90%) With only 'read + grep', restricting the agent to file inspection and exact or pattern-based search, the agent achieves 61% accuracy on BrowseComp-Plus, outperforming the retrieval-agent baseline using Qwen3-Embedding-8B (45%) by 16 points.
Specific quantitative results from controlled ablation are provided in paragraph 47: with only 'read + grep', the agent achieves 61% accuracy on BrowseComp-Plus, outperforming Qwen3-Embedding-8B baseline (45%) by 16 points.
已证实 (85%) Enabling the bash command set adds a further 12-point gain, but at the cost of substantially higher tool usage, latency, and compute.
Specific quantitative results from controlled ablation are provided in paragraph 47: enabling bash command set adds a further 12-point gain (from 61% to presumably 73%, though the exact number isn't stated, the 12-point gain is explicit). The paper n

... 共 36 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

局限性(作者自述)

本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv

分析时间:2026-05-09T07:11:20+00:00 · 数据来源:Paper Collector