AcademiClaw introduces the first academic-level AI agent benchmark with 80 bilingual tasks from real university workflows. Six frontier models achieve at most 55% pass rate, with over 22% of tasks showing up to 90-point score swings.
核心问题
How well do current frontier AI agents perform on complex, academic-level tasks compared to the assistant-level tasks that existing benchmarks evaluate?
核心方法
{'approach': 'The authors constructed AcademiClaw through bottom-up collection from undergraduate students who contributed real academic problems that current AI agents failed to solve. From 230 candidates, expert review distilled 80 bilingual tasks across 25+ domains. Each task runs in isolated Docker containers with multi-dimensional rubrics combining six verification techniques, and six frontier models were evaluated under identical conditions.', 'key_components': ['Four candidate judges were compared: GPT-5.2, Claude Sonnet 4.5, Claude Opus 4.5, and GLM-5.', 'Sonnet 4.5 and GPT-5.2 achieved highest correlation with human annotations (r = 0.93 and 0.91 respectively).', 'GPT-5.2 was selected for cost efficiency and absence of self-evaluation bias.', 'GPT-5.2 is excluded from the evaluated model set (GPT-5.4 is evaluated instead).', 'Pairwise model correlations range from 0.275 to 0.729, with mean of 0.54.', 'The wide spread indicates distinct capability profiles across models.', 'Least correlated pairs excel on complementary subsets of tasks.', 'Highly correlated pairs may reflect overlapping training data or similar fine-tuning pipelines.', 'All per-model token-score correlations are statistically indistinguishable from zero, confirming the null result is not an artifact of averaging heterogeneous trends.', 'The two highest-token-spending models (Gemini 3.1 Pro and MiniMax M2.7) fail to convert token expenditure into score gains, indicating agents lack an effective stopping criterion.'], 'section_ids': ['sec_10', 'sec_15', 'sec_31']}
论点验证
The paper provides extensive concrete evidence for AcademiClaw's construction: 80 tasks (p_8), bilingual composition (49 English, 31 Chinese in p_8, p_11), Docker containerization (p_13, p_49), multi-dimensional rubrics (p_10, p_14), and detailed tas
The paper provides strong evidence for most components: 80 bilingual tasks (p_8), 25+ domains (p_10), 16 GPU-intensive tasks (p_50), and student sourcing (p_8, p_25). However, the claims 'first academic-level benchmark within the OpenClaw ecosystem'
This is a novelty claim about being the 'first' benchmark with a particular property. Verifying this would require comprehensive knowledge of all existing agent benchmarks and their task sourcing methodologies. The paper asserts this with 'to our kno
The paper provides detailed evidence for the bottom-up collection strategy: p_8 describes undergraduate students contributing problems from real academic workflows, and p_25 confirms contributors were undergraduate students enrolled in an LLM Technol
The paper provides concrete evidence for both components: Docker containerization is detailed in p_13 and p_49-50, and the six verification techniques are explicitly listed in p_10 and p_14 (pattern matching, code execution, LLM-as-Judge, vision LLM
The paper explicitly lists all six models in p_4 and p_42, and provides detailed evidence for identical evaluation conditions: p_15 states 'Every model-task pair runs inside the same Docker sandbox', p_45 describes the single pinned OpenClaw build, a
The paper states in p_4 that 'Even the best-performing model achieves only a 55% pass rate (score ≥ 75 out of 100)'. This is a self-reported result without external validation. The pass threshold of 75 is confirmed in p_14. While the specific number
The claim is stated in p_4 ('over 22% of tasks exhibit capability boundaries where scores swing by up to 90 points'), but the paper does not provide a breakdown showing which specific tasks meet this criterion or how the 22% figure was calculated. Th
The claim has mixed evidence. P_4 states 'olympiad-level problems remaining universally unsolved' and p_53 shows the IOL linguistics task has mean score 17.3 (very low). However, p_53 confusingly states 'The agent solves all five problems from the 22
The paper provides specific quantitative evidence: p_4 and p_20 state the Pearson correlation is r = -0.03 with p = 0.49. P_19 documents Gemini consuming 2,857K tokens vs GPT-5.4's 525K (5.4× difference). P_20 confirms 'token consumption varies by ov
P_4 mentions three phenotypes, and p_19 provides detailed behavioral analysis with specific metrics: read-first (Claude models with Read% 45-47%), execute-first (Gemini with Exec% 74.3%, MiniMax with Exec% 65.9%), and minimalist (GPT-5.4 with fewest
P_8 explicitly states 'each contributor was required to have previously attempted the problem with at least one mainstream AI agent and confirmed that the agent either failed outright or required extensive multi-turn interaction to produce an accepta
P_8 explicitly states 'This process yielded 230 candidate tasks.' This is a straightforward factual claim about the data collection process that is clearly documented.
P_8 provides the complete funnel: 'distilled the initial 230 candidates into a final set of 80 high-quality tasks (49 English, 31 Chinese).' P_30 confirms 'Of the 230 raw candidate submissions collected, 150 were removed during two expert-review roun
P_9 provides specific quantitative metrics: 'agents invoke an average of 33 tool calls per task (up to 136 for the most complex ones), with a mean execution time of 11.7 minutes and a maximum exceeding 40 minutes.' These are concrete measurements fro
P_10 states 'The resulting 80 tasks span six primary categories and 25+ professional domains, as depicted in Figure 2b.' The claim is directly stated with reference to a figure. However, the figure itself is not provided in the text, so the exact cat
P_10 states 'Each task defines a custom rubric with 3-6 orthogonal scoring dimensions that sum to 100 points.' P_14 confirms 'Each task defines its own eval/rubric.py... with 3-6 orthogonal scoring dimensions that sum to 100 points.' P_31 provides a
P_10 and p_14 explicitly list the six techniques: 'pattern matching, code execution, LLM-as-Judge, vision LLM assessment, end-to-end browser testing, and structured-output validation.' P_31-36 provides a detailed worked example (en_blackhole_visualiz
P_8 states the final set is '80 high-quality tasks (49 English, 31 Chinese).' P_11 confirms 'The benchmark comprises 49 English and 31 Chinese tasks.' The numbers are stated consistently in multiple locations.
P_13 describes 'isolated Docker containers organized in a two-layer image hierarchy: a base layer providing either a CPU or GPU environment, and a per-task layer adding task-specific dependencies.' P_49 provides additional details on the two base ima
... 共 54 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- Complete task dataset: The 80 evaluation tasks are not described, nor is the task creation process detailed
- Rubric specification: Full rubric dimensions and scoring criteria are not provided (only partial safety categories S4-S5 shown)
- Judge decoding configuration: Referenced as 'specified in Appendix D' but not included in provided text
- Human expert annotation protocol: Number of experts, selection criteria, and annotation guidelines not specified
- Model API configurations: Temperature, top_p, max_tokens, and other generation parameters for all evaluated models not provided
- System prompts: Mentioned as 'identical' but actual prompt text not included
- Tool palette specification: Tools available to agents are referenced but not detailed
- Random seeds: No mention of random seed setting for reproducibility
- Hardware specifications: GPU details for CUDA-enabled worker pool not specified
- Pilot study methodology: How the 25 stratified task outputs were selected is not explained
局限性(作者自述)
- The current task set is sourced from CS undergraduates at a single university, and after rigorous filtering only 80 tasks remain; while these already span 25+ domains, collecting tasks from students across additional disciplines and institutions would further expand the benchmark's scale and representativeness.
- All results are based on single-attempt evaluation; we plan to introduce multi-trial protocols such as Pass 𝑘 (𝑘 = 3, 5) as well as retry mechanisms with feedback, which would provide more robust capability estimates.
- Our model coverage is not yet comprehensive-we evaluate six frontier models but do not include recent releases (e.g., GPT-5.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-05-05T07:12:53+00:00 · 数据来源:Paper Collector