Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents - AI 论文深度分析

TL;DR
Claw-Eval introduces a trustworthy evaluation framework for autonomous LLM agents through full-trajectory auditing and multi-dimensional scoring across 300 tasks.

已证实

证据不足

无法验证

N/A

可复现性

置信度

84%

核心问题

How can autonomous LLM agents be evaluated in a trustworthy manner that captures actual behavior rather than reported actions, and accounts for safety, robustness, and cross-modal performance?

核心方法

{'approach': 'Claw-Eval evaluates 14 frontier models on 300 human-verified tasks using a three-phase execution lifecycle with three independent evidence channels (execution traces, server-side audit logs, environment snapshots). The framework scores agents across Completion, Safety, and Robustness dimensions, running multiple trials per task to report Average Score, Pass@k, and Pass^k metrics.', 'key_components': ['14 frontier models from seven model families are evaluated under identical scaffold and tool configurations.', 'All models are evaluated on General (161 tasks) and Multi-turn dialogue (38 tasks).', 'Only 9 models with visual input support are evaluated on Multimodal tasks (101 tasks).', 'Performance differences reflect model capability rather than integration artifacts.'], 'section_ids': ['sec_11']}

论点验证

已证实 (85%) We introduce Claw-Eval, an end-to-end evaluation suite that addresses all three gaps within a unified platform, organized around three corresponding design principles.
The paper provides substantial detail on Claw-Eval's design across multiple sections (p_4, p_13-35). The three gaps (G1-G3) are clearly identified in p_3, and p_4 explicitly describes three corresponding design principles. The framework architecture,

已证实 (88%) We introduce Claw-Eval, an end-to-end evaluation suite of 300 humanverified tasks across 9 categories, featuring full-trajectory auditing, unified cross-modal coverage, and integrated multi-dimensional scoring along Completion, Safety, and Robustness, with 2,159 independently verifiable rubric items.
All specific numbers are documented in the paper: 300 tasks (p_4, p_17), 9 categories (p_4, p_17), 3 groups with exact counts (p_17: General 161, Multimodal 101, Multi-turn 38), and 2,159 rubric items with mean calculation (p_29). The human-verified

已证实 (82%) Every agent action is recorded through three independent evidence channels (execution traces, serviceside audit logs, and post-execution environment snapshots), enabling grading that verifies what the agent actually did rather than what it reported having done.
The three evidence channels are explicitly described across p_13-16: execution traces (p_16), server-side audit logs (p_13, p_15), and post-execution environment snapshots (p_13, p_16). The design is clearly specified, though the actual implementatio

已证实 (85%) The scoring protocol evaluates Completion, Safety, and Robustness as coupled dimensions within the same task execution.
p_23 explicitly states the three orthogonal dimensions (Completion, Safety, Robustness) and p_4 confirms they are evaluated 'as coupled dimensions within the same task execution.' The scoring formula in p_23-24 shows how they combine.

已证实 (85%) Safety constraints are embedded within normal workflow tasks and Robustness is measured through controlled error-rate injection that simulates realistic deployment perturbations.
p_18 states '43 tasks further embed safety constraints' within normal workflow tasks. p_22 and p_27-28 describe controlled error injection for robustness measurement. Both design choices are clearly specified with concrete implementation details.

已证实 (80%) A single declarative task schema accommodates 300 human-verified tasks spanning 9 categories across 3 groups (General service orchestration, Multimodal perception and generation, and Multi-turn professional dialogue).
p_17 confirms 300 tasks, 9 categories, and 3 groups with specific counts. p_21 mentions 'declarative task schema' that accommodates all task types. However, the schema itself is not shown, so the 'single declarative schema' claim cannot be fully veri

已证实 (88%) Every task is run for k independent trials, and we report three complementary metrics: Average Score (overall capability), Pass@k (capability ceiling), and Pass k (reliability floor), providing a complete picture of deployable capability.
p_5 and p_30-34 explicitly describe the multi-trial protocol with three metrics: Average Score, Pass@k, and Pass k. The formulas are provided in p_30-34. The paper uses k=3 throughout experiments (p_38).

已证实 (90%) A vanilla LLM judge operating without audit logs or environment snapshots misses 44% of safety violations and 13% of robustness failures that our hybrid grading pipeline catches, demonstrating that evaluation design choices materially affect benchmark conclusions.
p_43 provides specific numbers: 12 out of 27 safety violations missed (44%) and 15 out of 118 robustness issues missed (13%). These are concrete experimental results with clear methodology described in p_41-42.

证据不足 (60%) Model rankings shift substantially across task groups and modalities, with no single model dominating all domains.
p_39 states rankings shift across modalities, but the specific rankings and numerical data are not shown in the provided text. The claim about 'no single model dominating all domains' is stated but not substantiated with visible comparative tables or

已证实 (88%) Robustness under perturbation constitutes an independent capability axis, with Pass@3 remaining stable under error injection while Pass 3 dropping by up to 24 percentage points, and is not predictable from nominal-condition performance.
p_46 provides specific numbers: Pass@3 remains stable (Claude-Opus-4.6 drops only 3.7%, GLM-5-Turbo rises 1.2%), while Pass^3 drops sharply (Gemini-3.1-Pro loses 24.2%, Claude-Opus-4.6 loses 14.3%, GLM-5-Turbo loses 12.4%). The 'up to 24 percentage p

已证实 (90%) Multi-turn dialogue success hinges on the quality of the agent's questioning strategy (r=0.87) rather than conversational length (r=0.07).
p_48 reports r=0.07 for round count correlation with Pass^3, and p_49 reports r=0.87 for question precision correlation with Pass^3. These are specific statistical results with R² values also provided (R²<0.01 and R²=0.76 respectively).

已证实 (80%) The framework is built on a core premise: trustworthy agent evaluation requires grounding every score in evidence of what the agent actually did, rather than what it reported having done.
p_13 explicitly states this core premise. The design is elaborated throughout §3 with evidence channels and grading methodology. However, this is a design philosophy claim rather than an empirically validated finding.

已证实 (82%) A strict temporal boundary separates execution from grading: no scoring script, reference answer, or verification utility exists inside the container while the agent is running, ensuring that all observed behavior reflects genuine problem-solving rather than any evaluation-aware adaptation.
p_14 explicitly describes the temporal firewall: 'A strict temporal boundary separates execution from grading: no scoring script, reference answer, or verification utility exists inside the container while the agent is running.' The implementation is

已证实 (82%) Tasks may declare mock services such as CRM platforms, email gateways, scheduling systems, and knowledge bases, which are deployed outside the sandbox and accessible only through designated tool-call interfaces.
p_15 lists mock services (CRM platforms, email gateways, scheduling systems, knowledge bases) and specifies they are 'deployed outside the sandbox and accessible only through designated tool-call interfaces.'

证据不足 (60%) The system layer offers eleven built-in tools spanning code execution, file operations, codebase search, web interaction, and multimodal media processing, covering the core action space that real-world agent tasks demand.
p_16 mentions 'eleven built-in tools spanning code execution, file operations, codebase search, web interaction, and multimodal media processing' but Table 2 which allegedly details these tools is not visible in the provided text. The exact tools can

已证实 (82%) Throughout execution, the complete agentic context is recorded in a structured execution trace, maintained outside the sandbox and invisible to the agent, that will serve as a primary evidence source during grading.
p_16 explicitly states: 'the complete agentic context is recorded in a structured execution trace, maintained outside the sandbox and invisible to the agent, that will serve as a primary evidence source during grading.'

已证实 (85%) 43 tasks further embed safety constraints: actions the agent must not take, such as sending an email during a triage-only task or exposing credentials, testing whether agents respect policy boundaries under genuine task-completion pressure.
p_18 states '43 tasks further embed safety constraints' with examples (sending email during triage-only task, exposing credentials). The specific count and examples are provided.

已证实 (82%) All multimodal tasks work with media assets injected as workspace files using the built-in sandbox tools; no mock services are declared.
p_19 explicitly states: 'All multimodal tasks work with media assets injected as workspace files using the built-in sandbox tools; no mock services are declared.'

已证实 (85%) Claw-Eval evaluates three orthogonal dimensions: Completion measures the degree to which the agent fulfilled the task objective; Safety assesses whether the agent respected policy constraints throughout execution; and Robustness quantifies how effectively the agent recovers from transient environmental failures.
p_23 provides explicit definitions for all three dimensions: Completion, Safety, and Robustness with their respective measurement approaches.

已证实 (88%) We set α=0.8 and β=0.2 throughout our experiments, reflecting that completion is the primary objective while robustness remains an important secondary signal.
p_24 explicitly states: 'We set α=0.8 and β=0.2 throughout our experiments, reflecting that completion is the primary objective while robustness remains an important secondary signal.'

... 共 45 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

No code available - the evaluation framework, scaffolds, and tool configurations are not publicly released
No data available - the 161 General tasks, 38 Multi-turn dialogue tasks, and 101 Multimodal tasks are not provided
Task definitions and specifications are completely missing - no description of what each task entails
Rubric items and scoring criteria are not documented - unclear how tasks are evaluated and scored
Scaffold and tool configurations mentioned but not detailed - specific tools and scaffolding setup unknown
Docker sandbox configuration details not provided
LLM judge prompts not specified - exact prompts used for Gemini-3-Flash and Claude Opus-4.6 as judges are unknown
Random seeds not reported for reproducibility of the 3 independent trials
Hardware/environment specifications not mentioned
API access details and model version specifics unclear

局限性（作者自述）

论文中未明确列出局限性。

本分析由 PDF 阅读助手自动生成，仅供参考，不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析，可能存在偏差。原始论文请参阅 arXiv。

分析时间：2026-04-19T07:20:47+00:00 · 数据来源：Paper Collector