OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models - AI 论文深度分析

TL;DR
OCCUBENCH benchmarks AI agents on 100 real-world professional tasks using Language Environment Simulators. Testing 15 frontier models reveals implicit faults are harder than explicit faults, scaling improves performance, and strong agents aren't necessarily strong simulators.

已证实

证据不足

无法验证

N/A

可复现性

置信度

88%

核心问题

How can AI agents be systematically evaluated on real-world professional tasks across diverse occupational domains where no existing benchmarks exist?

核心方法

{'approach': 'The authors introduce Language Environment Simulators (LES) that use LLMs to simulate tool-response-level environment interactions for evaluation. The benchmark covers 100 professional task scenarios across 10 industries and 65 specialized domains, with 382 solvable task instances evaluated under clean conditions and three fault injection scenarios (explicit, implicit, and mixed faults).', 'key_components': ['LLMs can serve as world models of the internet for web agent planning, as demonstrated by prior work.', 'WebWorld trains the first open-web simulator at scale on 1M+ interactions for agent training and inference-time search.', 'The LES approach uses LLMs to simulate tool-response-level environment interaction specifically for evaluation purposes.', 'LES supports stateful multi-step professional tasks with realistic action spaces across 100 scenarios and 65 specialized domains.', 'Larger models consistently outperform smaller counterparts within the same family.', 'Performance gaps range from 7.1% to 11.0% between large and small variants.', 'Claude 4.5 is an exception where Opus and Sonnet perform nearly identically.', "The 4.5 generation's architectural improvements benefited both model sizes equally.", 'Each model exhibits a unique pattern of strengths and weaknesses across different industries.', 'Gemini 3.1 Pro excels in knowledge-intensive domains that reward factual accuracy and structured reasoning.'], 'section_ids': ['sec_3', 'sec_21', 'sec_27']}

论点验证

已证实 (75%) We present OCCUBENCH, the first benchmark systematically evaluating AI agents on realworld professional tasks across 100 scenarios, 65 specialized domains, and 10 industry categories.
The paper provides concrete evidence of OCCUBENCH's scope: p_22 states 'OCCUBENCH covers 100 professional task scenarios across 10 industry categories and 65 specialized domains' and p_23 confirms '382 solvable task instances spanning all 100 scenari

已证实 (80%) Our Language Environment Simulator approach occupies a distinct niche: using LLMs to simulate toolresponse-level environment interaction for evaluation rather than training, supporting stateful multi-step professional tasks with realistic action spaces across 100 scenarios and 65 specialized domains.
The paper clearly defines the LES approach in p_12-p_17 and contrasts it with prior work in p_11, noting that approaches like WebWorld use world models for 'training' while LES is for 'evaluation.' The scope claim (100 scenarios, 65 domains) is suppo

已证实 (95%) We define a Language Environment Simulator (LES) as a function: where c = (system prompt, tool schema, initial state, state description) is the environment configuration, s t is the latent environment state maintained implicitly by the LLM through its context window, a t is the agent's action (a tool call with name and arguments), and o t+1 is the observation returned to the agent (a structured JSON tool response).
The paper provides a formal definition in p_12-p_13 with explicit specification of all components: c = (system prompt, tool schema, initial state, state description), s_t, a_t, and o_t+1. This is a complete methodological specification.

无法验证 (85%) Each LES environment is fully specified by four components: System Prompt, Tool Schema, State Description.
The claim states 'four components' but only lists 3 items (System Prompt, Tool Schema, State Description), missing 'initial state.' The paper in p_13 explicitly lists 4 components: 'c = (system prompt, tool schema, initial state, state description).'

已证实 (90%) Each environment contains 2-10 tools (median 5) reflecting realistic operational interfaces.
The paper states in p_16 that 'Each environment contains 2-10 tools (median 5).' This is corroborated by p_23 which reports 'The final dataset averages 5.5 tools' - consistent with a median of 5.

已证实 (95%) Each evaluation instance must satisfy four conditions: (1) solvable: a valid solution exists and is verified; (2) verifiable: clear, automated success criteria; (3) discriminative: calibrated difficulty that distinguishes agent capabilities; and (4) diverse: structural variation across instances.
The paper explicitly states all four conditions in p_18 with clear definitions for each: solvable, verifiable, discriminative, and diverse. This is a complete methodological specification.

已证实 (95%) We design 16 non-overlapping sub-topics per scenario and construct a professional reference document for each, covering domain terminology, workflows, state variables, edge cases, and constraints.
The paper states in p_19: 'we design 16 non-overlapping sub-topics per scenario and construct a professional reference document for each, covering domain terminology, workflows, state variables, edge cases, and constraints.' This is a clear methodolo

已证实 (95%) We employ a multi-agent synthesis pipeline powered by Gemini-3-Flash-Preview as the Language Environment Simulator.
The paper states in p_20: 'We employ a multi-agent synthesis pipeline powered by Gemini-3-Flash-Preview as the Language Environment Simulator.' This is a clear specification of the synthesis approach.

已证实 (95%) Tasks that are trivially easy (100% autonomous success), unsolvable (0% success), or have invalid tool schemas are filtered out.
The paper states in p_21: 'Tasks that are trivially easy (100% autonomous success), unsolvable (0% success), or have invalid tool schemas are filtered out.' This is a clear quality control specification.

已证实 (85%) OCCUBENCH covers 100 professional task scenarios across 10 industry categories and 65 specialized domains. Each scenario maps to a real human job role, ensuring evaluation results have direct practical relevance.
The paper states the coverage numbers in p_22 and confirms that 'Each scenario maps to a real human job role.' Table 1 is referenced but not shown in the provided text, so the detailed breakdown cannot be verified. The 382 instances spanning all 100

已证实 (85%) Our evaluation set contains 382 solvable task instances spanning all 100 scenarios. The final dataset averages 5.5 tools and 16.2 tool calls per task.
The paper provides specific quantitative metrics in p_23: '382 solvable task instances spanning all 100 scenarios' and 'averages 5.5 tools and 16.2 tool calls per task.' These are self-reported results without external validation.

已证实 (95%) For each task, we select the difficulty level with the lowest autonomous success rate to maximize discriminative power.
The paper states in p_23: 'For each task, we select the difficulty level with the lowest autonomous success rate to maximize discriminative power.' This is a clear methodological specification.

已证实 (95%) All data is synthesized in clean environments (E0); faults are injected by appending fault rules to the LES's system prompt during evaluation.
The paper states in p_27: 'All data is synthesized in clean environments (E0); faults are injected by appending fault rules to the LES's system prompt during evaluation.' This is a clear specification of the fault injection methodology.

已证实 (95%) All faults are transient (retrying recovers normal results), spaced across the interaction (not concentrated at the start), and parameterized by two independent controls: fault count (number of fault events, default 2) and fault duration (consecutive tool calls affected per event, default 2).
The paper states in p_32: 'All faults are transient (retrying recovers normal results), spaced across the interaction (not concentrated at the start), and parameterized by two independent controls: fault count (number of fault events, default 2) and

已证实 (95%) We evaluate 15 frontier models spanning 8 model families: OpenAI (GPT-5.2), Anthropic (Claude Opus/Sonnet 4, 4.5, 4.6), Google (Gemini 3.1 Pro, Flash-Lite), DeepSeek (V3.2), Moonshot (Kimi K2.5), MiniMax (M2.7), Zhipu (GLM-5), and Alibaba (Qwen 3.5 Plus, Flash).
The paper lists all 15 models across 8 families in p_35 with specific model names and citations. This is a complete specification of the evaluation scope.

已证实 (85%) Implicit faults are harder than explicit faults: Average performance under E2 (implicit, 53.4%) drops far more than E1 (explicit, 62.6%) from E0 (67.5%).
The paper provides specific quantitative results in p_3: E0 (67.5%), E1 explicit faults (62.6%), E2 implicit faults (53.4%). The drop from E0 to E2 (14.1 points) is indeed larger than E0 to E1 (4.9 points). Self-reported results.

已证实 (85%) Claude Opus 4.6 drops from 71.5% (fc=1) to 60.2% (fc=4), and from 67.8% (fd=1) to 57.9% (fd=4). Qwen 3.5 Plus degrades from 61.3% to 49.7% (count) and 59.7% to 49.2% (duration).
The paper provides specific quantitative results in p_37 for fault parameter ablation: Claude Opus 4.6 drops from 71.5% (fc=1) to 60.2% (fc=4), and from 67.8% (fd=1) to 57.9% (fd=4). Qwen 3.5 Plus degrades from 61.3% to 49.7% (count) and 59.7% to 49.

已证实 (85%) Larger models consistently outperform smaller counterparts, with gaps of 11.0% (Gemini Pro vs. Flash-Lite), 10.2% (Qwen Plus vs. Flash), and 7.1% (Claude Opus vs. Sonnet 4.6).
The paper provides specific quantitative results in p_39: gaps of 11.0% (Gemini Pro vs. Flash-Lite), 10.2% (Qwen Plus vs. Flash), and 7.1% (Claude Opus vs. Sonnet 4.6). Self-reported results.

已证实 (85%) Claude Opus shows consistent generational improvement: 61.3% → 65.2% → 71.5% (+10.2% total).
The paper provides specific quantitative results in p_43: Claude Opus shows 61.3% → 65.2% → 71.5% (+10.2% total) across three generations. Self-reported results.

已证实 (85%) GPT-5.2 exhibits a clear monotonic trend: scaling from none (54.7%) to xhigh (82.2%), a 27.5-point improvement, demonstrating that deeper reasoning directly translates to better task execution.
The paper provides specific quantitative results in p_45: GPT-5.2 scales from none (54.7%) to xhigh (82.2%), a 27.5-point improvement. Self-reported results.

... 共 37 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

The complete OCCUBENCH benchmark dataset (100 scenarios across 65 domains) is not publicly available
Language Environment Simulator implementation details - how tool responses are simulated, architecture, prompts used
Exact prompts and prompt templates used for each model evaluation
Evaluation criteria and scoring methodology - how success/failure is determined for each scenario
Model hyperparameters (temperature, top-p, max tokens, etc.) for all 15 models tested
API versions and access dates for each model
Number of evaluation runs per scenario and whether results are averaged
Statistical significance measures (confidence intervals, standard errors)
Ground truth answers/expected outputs for the 100 scenarios
Hardware and computational environment specifications

局限性（作者自述）

Language Environment Simulators model domain logic rather than domain data. An LES understands that a drug interaction check should return contraindications, but the specific values it returns are generated rather than retrieved from a real database. This means OCCUBENCH evaluates an agent's decision-making process (whether it checks the right things in the right order) rather than its ability to handle exact real-world data values.
For domains where precise numerical correctness is critical (e.g., financial calculations to the cent), LES-based evaluation should be complemented with real-environment testing.
As our cross-simulator experiments demonstrate, evaluation results are tied to the specific simulator used during data synthesis. Tasks verified as solvable under Gemini-3-Flash may become unsolvable under a different LES, and agent rankings can shift when the simulator changes. This is an inherent limitation of any LES-based evaluation: the simulator is part of the evaluation apparatus, not a neutral observer.

本分析由 PDF 阅读助手自动生成，仅供参考，不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析，可能存在偏差。原始论文请参阅 arXiv。

分析时间：2026-04-23T13:35:16+00:00 · 数据来源：Paper Collector