CLAWBENCH evaluates AI agents on 153 real-world web tasks across 144 live platforms, revealing dramatic performance gaps: Claude Sonnet 4.6 and GPT-5.4 achieve 65-75% on traditional benchmarks but only 33.3% and 6.5% on CLAWBENCH, showing frontier models struggle with production websites.
核心问题
How well can AI agents complete everyday online tasks on real production websites, compared to their performance on existing sandbox-based benchmarks?
核心方法
{'approach': 'CLAWBENCH evaluates agents on 153 tasks across 144 live platforms using a Chrome extension that intercepts final submission requests while allowing full interaction with production websites. A five-layer recording infrastructure captures session replays, screenshots, HTTP traffic, agent messages, and browser actions, enabling an Agentic Evaluator to compare agent trajectories against human ground-truth references.', 'key_components': [], 'section_ids': []}
论点验证
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- No code available for the benchmark or evaluation framework
- No data available for CLAWBENCH benchmark
- Missing hyperparameters: temperature, top-p, max tokens, and other generation parameters for all 7 models
- No random seeds specified for reproducibility
- Missing hardware specifications (GPU, CPU, memory)
- No specific container configuration details provided
- Missing Chrome version and complete browser configuration
- No API versions or endpoints specified for proprietary models (Claude, GPT, Gemini)
- Missing number of experimental runs and statistical analysis details
- No information about CLAWBENCH benchmark structure, tasks, or scenarios
局限性(作者自述)
论文中未明确列出局限性。
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-12T13:06:05+00:00 · 数据来源:Paper Collector