TL;DR
CLAWBENCH evaluates AI agents on 153 real-world web tasks across 144 live platforms, revealing dramatic performance gaps: Claude Sonnet 4.6 and GPT-5.4 achieve 65-75% on traditional benchmarks but only 33.3% and 6.5% on CLAWBENCH, showing frontier models struggle with production websites.
0
已证实
0
证据不足
0
无法验证
N/A
可复现性
置信度
0%

核心问题

How well can AI agents complete everyday online tasks on real production websites, compared to their performance on existing sandbox-based benchmarks?

核心方法

{'approach': 'CLAWBENCH evaluates agents on 153 tasks across 144 live platforms using a Chrome extension that intercepts final submission requests while allowing full interaction with production websites. A five-layer recording infrastructure captures session replays, screenshots, HTTP traffic, agent messages, and browser actions, enabling an Agentic Evaluator to compare agent trajectories against human ground-truth references.', 'key_components': [], 'section_ids': []}

论点验证

可复现性评估

较低可复现性 (0%)

缺失的复现细节

局限性(作者自述)

论文中未明确列出局限性。

本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv

分析时间:2026-04-12T13:06:05+00:00 · 数据来源:Paper Collector