Claw-Eval introduces a trustworthy evaluation framework for autonomous LLM agents through full-trajectory auditing and multi-dimensional scoring across 300 tasks.
核心问题
How can autonomous LLM agents be evaluated in a trustworthy manner that captures actual behavior rather than reported actions, and accounts for safety, robustness, and cross-modal performance?
核心方法
{'approach': 'Claw-Eval evaluates 14 frontier models on 300 human-verified tasks using a three-phase execution lifecycle with three independent evidence channels (execution traces, server-side audit logs, environment snapshots). The framework scores agents across Completion, Safety, and Robustness dimensions, running multiple trials per task to report Average Score, Pass@k, and Pass^k metrics.', 'key_components': ['14 frontier models from seven model families are evaluated under identical scaffold and tool configurations.', 'All models are evaluated on General (161 tasks) and Multi-turn dialogue (38 tasks).', 'Only 9 models with visual input support are evaluated on Multimodal tasks (101 tasks).', 'Performance differences reflect model capability rather than integration artifacts.'], 'section_ids': ['sec_11']}
论点验证
The paper provides substantial detail on Claw-Eval's design across multiple sections (p_4, p_13-35). The three gaps (G1-G3) are clearly identified in p_3, and p_4 explicitly describes three corresponding design principles. The framework architecture,
All specific numbers are documented in the paper: 300 tasks (p_4, p_17), 9 categories (p_4, p_17), 3 groups with exact counts (p_17: General 161, Multimodal 101, Multi-turn 38), and 2,159 rubric items with mean calculation (p_29). The human-verified
The three evidence channels are explicitly described across p_13-16: execution traces (p_16), server-side audit logs (p_13, p_15), and post-execution environment snapshots (p_13, p_16). The design is clearly specified, though the actual implementatio
p_23 explicitly states the three orthogonal dimensions (Completion, Safety, Robustness) and p_4 confirms they are evaluated 'as coupled dimensions within the same task execution.' The scoring formula in p_23-24 shows how they combine.
p_18 states '43 tasks further embed safety constraints' within normal workflow tasks. p_22 and p_27-28 describe controlled error injection for robustness measurement. Both design choices are clearly specified with concrete implementation details.
p_17 confirms 300 tasks, 9 categories, and 3 groups with specific counts. p_21 mentions 'declarative task schema' that accommodates all task types. However, the schema itself is not shown, so the 'single declarative schema' claim cannot be fully veri
p_5 and p_30-34 explicitly describe the multi-trial protocol with three metrics: Average Score, Pass@k, and Pass k. The formulas are provided in p_30-34. The paper uses k=3 throughout experiments (p_38).
p_43 provides specific numbers: 12 out of 27 safety violations missed (44%) and 15 out of 118 robustness issues missed (13%). These are concrete experimental results with clear methodology described in p_41-42.
p_39 states rankings shift across modalities, but the specific rankings and numerical data are not shown in the provided text. The claim about 'no single model dominating all domains' is stated but not substantiated with visible comparative tables or
p_46 provides specific numbers: Pass@3 remains stable (Claude-Opus-4.6 drops only 3.7%, GLM-5-Turbo rises 1.2%), while Pass^3 drops sharply (Gemini-3.1-Pro loses 24.2%, Claude-Opus-4.6 loses 14.3%, GLM-5-Turbo loses 12.4%). The 'up to 24 percentage p
p_48 reports r=0.07 for round count correlation with Pass^3, and p_49 reports r=0.87 for question precision correlation with Pass^3. These are specific statistical results with R² values also provided (R²<0.01 and R²=0.76 respectively).
p_13 explicitly states this core premise. The design is elaborated throughout §3 with evidence channels and grading methodology. However, this is a design philosophy claim rather than an empirically validated finding.
p_14 explicitly describes the temporal firewall: 'A strict temporal boundary separates execution from grading: no scoring script, reference answer, or verification utility exists inside the container while the agent is running.' The implementation is
p_15 lists mock services (CRM platforms, email gateways, scheduling systems, knowledge bases) and specifies they are 'deployed outside the sandbox and accessible only through designated tool-call interfaces.'
p_16 mentions 'eleven built-in tools spanning code execution, file operations, codebase search, web interaction, and multimodal media processing' but Table 2 which allegedly details these tools is not visible in the provided text. The exact tools can
p_16 explicitly states: 'the complete agentic context is recorded in a structured execution trace, maintained outside the sandbox and invisible to the agent, that will serve as a primary evidence source during grading.'
p_18 states '43 tasks further embed safety constraints' with examples (sending email during triage-only task, exposing credentials). The specific count and examples are provided.
p_19 explicitly states: 'All multimodal tasks work with media assets injected as workspace files using the built-in sandbox tools; no mock services are declared.'
p_23 provides explicit definitions for all three dimensions: Completion, Safety, and Robustness with their respective measurement approaches.
p_24 explicitly states: 'We set α=0.8 and β=0.2 throughout our experiments, reflecting that completion is the primary objective while robustness remains an important secondary signal.'
... 共 45 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- No code available - the evaluation framework, scaffolds, and tool configurations are not publicly released
- No data available - the 161 General tasks, 38 Multi-turn dialogue tasks, and 101 Multimodal tasks are not provided
- Task definitions and specifications are completely missing - no description of what each task entails
- Rubric items and scoring criteria are not documented - unclear how tasks are evaluated and scored
- Scaffold and tool configurations mentioned but not detailed - specific tools and scaffolding setup unknown
- Docker sandbox configuration details not provided
- LLM judge prompts not specified - exact prompts used for Gemini-3-Flash and Claude Opus-4.6 as judges are unknown
- Random seeds not reported for reproducibility of the 3 independent trials
- Hardware/environment specifications not mentioned
- API access details and model version specifics unclear
局限性(作者自述)
论文中未明确列出局限性。
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-19T07:20:47+00:00 · 数据来源:Paper Collector