ClawBench: Can AI Agents Complete Everyday Online Tasks? - AI 论文深度分析

TL;DR
CLAWBENCH evaluates AI agents on 153 real-world web tasks across 144 live platforms, revealing dramatic performance gaps: Claude Sonnet 4.6 and GPT-5.4 achieve 65-75% on traditional benchmarks but only 33.3% and 6.5% on CLAWBENCH, showing frontier models struggle with production websites.

已证实

证据不足

无法验证

N/A

可复现性

置信度

核心问题

How well can AI agents complete everyday online tasks on real production websites, compared to their performance on existing sandbox-based benchmarks?

核心方法

{'approach': 'CLAWBENCH evaluates agents on 153 tasks across 144 live platforms using a Chrome extension that intercepts final submission requests while allowing full interaction with production websites. A five-layer recording infrastructure captures session replays, screenshots, HTTP traffic, agent messages, and browser actions, enabling an Agentic Evaluator to compare agent trajectories against human ground-truth references.', 'key_components': [], 'section_ids': []}

论点验证

可复现性评估

较低可复现性 (0%)

缺失的复现细节

No code available for the benchmark or evaluation framework
No data available for CLAWBENCH benchmark
Missing hyperparameters: temperature, top-p, max tokens, and other generation parameters for all 7 models
No random seeds specified for reproducibility
Missing hardware specifications (GPU, CPU, memory)
No specific container configuration details provided
Missing Chrome version and complete browser configuration
No API versions or endpoints specified for proprietary models (Claude, GPT, Gemini)
Missing number of experimental runs and statistical analysis details
No information about CLAWBENCH benchmark structure, tasks, or scenarios

局限性（作者自述）

论文中未明确列出局限性。

本分析由 PDF 阅读助手自动生成，仅供参考，不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析，可能存在偏差。原始论文请参阅 arXiv。

分析时间：2026-04-12T13:06:05+00:00 · 数据来源：Paper Collector