AcademiClaw: When Students Set Challenges for AI Agents - AI 论文深度分析

TL;DR
AcademiClaw introduces the first academic-level AI agent benchmark with 80 bilingual tasks from real university workflows. Six frontier models achieve at most 55% pass rate, with over 22% of tasks showing up to 90-point score swings.

已证实

证据不足

无法验证

N/A

可复现性

置信度

89%

核心问题

How well do current frontier AI agents perform on complex, academic-level tasks compared to the assistant-level tasks that existing benchmarks evaluate?

核心方法

{'approach': 'The authors constructed AcademiClaw through bottom-up collection from undergraduate students who contributed real academic problems that current AI agents failed to solve. From 230 candidates, expert review distilled 80 bilingual tasks across 25+ domains. Each task runs in isolated Docker containers with multi-dimensional rubrics combining six verification techniques, and six frontier models were evaluated under identical conditions.', 'key_components': ['Four candidate judges were compared: GPT-5.2, Claude Sonnet 4.5, Claude Opus 4.5, and GLM-5.', 'Sonnet 4.5 and GPT-5.2 achieved highest correlation with human annotations (r = 0.93 and 0.91 respectively).', 'GPT-5.2 was selected for cost efficiency and absence of self-evaluation bias.', 'GPT-5.2 is excluded from the evaluated model set (GPT-5.4 is evaluated instead).', 'Pairwise model correlations range from 0.275 to 0.729, with mean of 0.54.', 'The wide spread indicates distinct capability profiles across models.', 'Least correlated pairs excel on complementary subsets of tasks.', 'Highly correlated pairs may reflect overlapping training data or similar fine-tuning pipelines.', 'All per-model token-score correlations are statistically indistinguishable from zero, confirming the null result is not an artifact of averaging heterogeneous trends.', 'The two highest-token-spending models (Gemini 3.1 Pro and MiniMax M2.7) fail to convert token expenditure into score gains, indicating agents lack an effective stopping criterion.'], 'section_ids': ['sec_10', 'sec_15', 'sec_31']}

论点验证

已证实 (95%) We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks designed to bridge this gap.
The paper provides extensive concrete evidence for AcademiClaw's construction: 80 tasks (p_8), bilingual composition (49 English, 31 Chinese in p_8, p_11), Docker containerization (p_13, p_49), multi-dimensional rubrics (p_10, p_14), and detailed tas

证据不足 (60%) We construct AcademiClaw, the first academic-level benchmark within the OpenClaw ecosystem, comprising 80 bilingual tasks sourced directly from university students' real academic workflows across 25+ domains, including 16 GPU-intensive tasks absent from all prior agent benchmarks.
The paper provides strong evidence for most components: 80 bilingual tasks (p_8), 25+ domains (p_10), 16 GPU-intensive tasks (p_50), and student sourcing (p_8, p_25). However, the claims 'first academic-level benchmark within the OpenClaw ecosystem'

无法验证 (90%) To our knowledge, AcademiClaw is also the first agent benchmark whose tasks originate entirely from university students rather than researchers or annotators.
This is a novelty claim about being the 'first' benchmark with a particular property. Verifying this would require comprehensive knowledge of all existing agent benchmarks and their task sourcing methodologies. The paper asserts this with 'to our kno

已证实 (95%) Rather than having researchers or annotators design tasks top-down, we adopt a bottom-up collection strategy: undergraduate students contribute problems from their real academic workflows-course assignments, research projects, competitions, and personal projects-that they found current AI agents unable to solve effectively.
The paper provides detailed evidence for the bottom-up collection strategy: p_8 describes undergraduate students contributing problems from real academic workflows, and p_25 confirms contributors were undergraduate students enrolled in an LLM Technol

已证实 (95%) Each task executes in an isolated Docker container and is scored by a multi-dimensional rubric combining six complementary verification techniques-deterministic checks, code execution, LLM-as-judge, vision LLM assessment, end-to-end browser testing, and structured-output validation.
The paper provides concrete evidence for both components: Docker containerization is detailed in p_13 and p_49-50, and the six verification techniques are explicitly listed in p_10 and p_14 (pattern matching, code execution, LLM-as-Judge, vision LLM

已证实 (95%) We evaluate six frontier models-Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro, Qwen3.5-397B, and MiniMax M2.7-under identical conditions via the OpenClaw agent framework.
The paper explicitly lists all six models in p_4 and p_42, and provides detailed evidence for identical evaluation conditions: p_15 states 'Every model-task pair runs inside the same Docker sandbox', p_45 describes the single pinned OpenClaw build, a

已证实 (85%) Even the best-performing model achieves only a 55% pass rate (score ≥ 75 out of 100), confirming that academic-level tasks pose a substantial challenge to current frontier agents.
The paper states in p_4 that 'Even the best-performing model achieves only a 55% pass rate (score ≥ 75 out of 100)'. This is a self-reported result without external validation. The pass threshold of 75 is confirmed in p_14. While the specific number

证据不足 (60%) Over 22% of tasks exhibit capability boundaries where scores swing by up to 90 points across models on the same task.
The claim is stated in p_4 ('over 22% of tasks exhibit capability boundaries where scores swing by up to 90 points'), but the paper does not provide a breakdown showing which specific tasks meet this criterion or how the 22% figure was calculated. Th

证据不足 (55%) Agents handle generative tasks well but struggle systematically with formal reasoning, with olympiad-level problems remaining universally unsolved.
The claim has mixed evidence. P_4 states 'olympiad-level problems remaining universally unsolved' and p_53 shows the IOL linguistics task has mean score 17.3 (very low). However, p_53 confusingly states 'The agent solves all five problems from the 22

已证实 (90%) Token consumption varies by over 5× across models yet shows near-zero correlation with quality (𝑟 = -0.03), indicating that reasoning depth rather than computational effort drives performance.
The paper provides specific quantitative evidence: p_4 and p_20 state the Pearson correlation is r = -0.03 with p = 0.49. P_19 documents Gemini consuming 2,857K tokens vs GPT-5.4's 525K (5.4× difference). P_20 confirms 'token consumption varies by ov

已证实 (85%) We further identify three distinct behavioral phenotypes-read-first, execute-first, and minimalist-that differ markedly in efficiency and safety profiles.
P_4 mentions three phenotypes, and p_19 provides detailed behavioral analysis with specific metrics: read-first (Claude models with Read% 45-47%), execute-first (Gemini with Exec% 74.3%, MiniMax with Exec% 65.9%), and minimalist (GPT-5.4 with fewest

已证实 (95%) Each contributor was required to have previously attempted the problem with at least one mainstream AI agent and confirmed that the agent either failed outright or required extensive multi-turn interaction to produce an acceptable solution.
P_8 explicitly states 'each contributor was required to have previously attempted the problem with at least one mainstream AI agent and confirmed that the agent either failed outright or required extensive multi-turn interaction to produce an accepta

已证实 (95%) This process yielded 230 candidate tasks.
P_8 explicitly states 'This process yielded 230 candidate tasks.' This is a straightforward factual claim about the data collection process that is clearly documented.

已证实 (95%) This two-stage process-student contribution followed by expert curation-distilled the initial 230 candidates into a final set of 80 high-quality tasks (49 English, 31 Chinese).
P_8 provides the complete funnel: 'distilled the initial 230 candidates into a final set of 80 high-quality tasks (49 English, 31 Chinese).' P_30 confirms 'Of the 230 raw candidate submissions collected, 150 were removed during two expert-review roun

已证实 (90%) Agents invoke an average of 33 tool calls per task (up to 136 for the most complex ones), with a mean execution time of 11.7 minutes and a maximum exceeding 40 minutes, reflecting extended chains of reading, coding, debugging, and verification.
P_9 provides specific quantitative metrics: 'agents invoke an average of 33 tool calls per task (up to 136 for the most complex ones), with a mean execution time of 11.7 minutes and a maximum exceeding 40 minutes.' These are concrete measurements fro

已证实 (85%) The resulting 80 tasks span six primary categories and 25+ professional domains.
P_10 states 'The resulting 80 tasks span six primary categories and 25+ professional domains, as depicted in Figure 2b.' The claim is directly stated with reference to a figure. However, the figure itself is not provided in the text, so the exact cat

已证实 (95%) Each task defines a custom rubric with 3-6 orthogonal scoring dimensions that sum to 100 points.
P_10 states 'Each task defines a custom rubric with 3-6 orthogonal scoring dimensions that sum to 100 points.' P_14 confirms 'Each task defines its own eval/rubric.py... with 3-6 orthogonal scoring dimensions that sum to 100 points.' P_31 provides a

已证实 (95%) Rubric methods combine six complementary techniques-pattern matching, code execution, LLM-as-Judge, vision LLM assessment, end-to-end browser testing, and structured-output validation-allowing fine-grained diagnosis of where and why an agent falls short.
P_10 and p_14 explicitly list the six techniques: 'pattern matching, code execution, LLM-as-Judge, vision LLM assessment, end-to-end browser testing, and structured-output validation.' P_31-36 provides a detailed worked example (en_blackhole_visualiz

已证实 (95%) The benchmark comprises 49 English and 31 Chinese tasks.
P_8 states the final set is '80 high-quality tasks (49 English, 31 Chinese).' P_11 confirms 'The benchmark comprises 49 English and 31 Chinese tasks.' The numbers are stated consistently in multiple locations.

已证实 (95%) All tasks run inside isolated Docker containers organized in a two-layer image hierarchy: a base layer providing either a CPU or GPU environment, and a per-task layer adding task-specific dependencies.
P_13 describes 'isolated Docker containers organized in a two-layer image hierarchy: a base layer providing either a CPU or GPU environment, and a per-task layer adding task-specific dependencies.' P_49 provides additional details on the two base ima

... 共 54 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

Complete task dataset: The 80 evaluation tasks are not described, nor is the task creation process detailed
Rubric specification: Full rubric dimensions and scoring criteria are not provided (only partial safety categories S4-S5 shown)
Judge decoding configuration: Referenced as 'specified in Appendix D' but not included in provided text
Human expert annotation protocol: Number of experts, selection criteria, and annotation guidelines not specified
Model API configurations: Temperature, top_p, max_tokens, and other generation parameters for all evaluated models not provided
System prompts: Mentioned as 'identical' but actual prompt text not included
Tool palette specification: Tools available to agents are referenced but not detailed
Random seeds: No mention of random seed setting for reproducibility
Hardware specifications: GPU details for CUDA-enabled worker pool not specified
Pilot study methodology: How the 25 stratified task outputs were selected is not explained

局限性（作者自述）

The current task set is sourced from CS undergraduates at a single university, and after rigorous filtering only 80 tasks remain; while these already span 25+ domains, collecting tasks from students across additional disciplines and institutions would further expand the benchmark's scale and representativeness.
All results are based on single-attempt evaluation; we plan to introduce multi-trial protocols such as Pass 𝑘 (𝑘 = 3, 5) as well as retry mechanisms with feedback, which would provide more robust capability estimates.
Our model coverage is not yet comprehensive-we evaluate six frontier models but do not include recent releases (e.g., GPT-5.

本分析由 PDF 阅读助手自动生成，仅供参考，不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析，可能存在偏差。原始论文请参阅 arXiv。

分析时间：2026-05-05T07:12:53+00:00 · 数据来源：Paper Collector