OCCUBENCH benchmarks AI agents on 100 real-world professional tasks using Language Environment Simulators. Testing 15 frontier models reveals implicit faults are harder than explicit faults, scaling improves performance, and strong agents aren't necessarily strong simulators.
核心问题
How can AI agents be systematically evaluated on real-world professional tasks across diverse occupational domains where no existing benchmarks exist?
核心方法
{'approach': 'The authors introduce Language Environment Simulators (LES) that use LLMs to simulate tool-response-level environment interactions for evaluation. The benchmark covers 100 professional task scenarios across 10 industries and 65 specialized domains, with 382 solvable task instances evaluated under clean conditions and three fault injection scenarios (explicit, implicit, and mixed faults).', 'key_components': ['LLMs can serve as world models of the internet for web agent planning, as demonstrated by prior work.', 'WebWorld trains the first open-web simulator at scale on 1M+ interactions for agent training and inference-time search.', 'The LES approach uses LLMs to simulate tool-response-level environment interaction specifically for evaluation purposes.', 'LES supports stateful multi-step professional tasks with realistic action spaces across 100 scenarios and 65 specialized domains.', 'Larger models consistently outperform smaller counterparts within the same family.', 'Performance gaps range from 7.1% to 11.0% between large and small variants.', 'Claude 4.5 is an exception where Opus and Sonnet perform nearly identically.', "The 4.5 generation's architectural improvements benefited both model sizes equally.", 'Each model exhibits a unique pattern of strengths and weaknesses across different industries.', 'Gemini 3.1 Pro excels in knowledge-intensive domains that reward factual accuracy and structured reasoning.'], 'section_ids': ['sec_3', 'sec_21', 'sec_27']}
论点验证
The paper provides concrete evidence of OCCUBENCH's scope: p_22 states 'OCCUBENCH covers 100 professional task scenarios across 10 industry categories and 65 specialized domains' and p_23 confirms '382 solvable task instances spanning all 100 scenari
The paper clearly defines the LES approach in p_12-p_17 and contrasts it with prior work in p_11, noting that approaches like WebWorld use world models for 'training' while LES is for 'evaluation.' The scope claim (100 scenarios, 65 domains) is suppo
The paper provides a formal definition in p_12-p_13 with explicit specification of all components: c = (system prompt, tool schema, initial state, state description), s_t, a_t, and o_t+1. This is a complete methodological specification.
The claim states 'four components' but only lists 3 items (System Prompt, Tool Schema, State Description), missing 'initial state.' The paper in p_13 explicitly lists 4 components: 'c = (system prompt, tool schema, initial state, state description).'
The paper states in p_16 that 'Each environment contains 2-10 tools (median 5).' This is corroborated by p_23 which reports 'The final dataset averages 5.5 tools' - consistent with a median of 5.
The paper explicitly states all four conditions in p_18 with clear definitions for each: solvable, verifiable, discriminative, and diverse. This is a complete methodological specification.
The paper states in p_19: 'we design 16 non-overlapping sub-topics per scenario and construct a professional reference document for each, covering domain terminology, workflows, state variables, edge cases, and constraints.' This is a clear methodolo
The paper states in p_20: 'We employ a multi-agent synthesis pipeline powered by Gemini-3-Flash-Preview as the Language Environment Simulator.' This is a clear specification of the synthesis approach.
The paper states in p_21: 'Tasks that are trivially easy (100% autonomous success), unsolvable (0% success), or have invalid tool schemas are filtered out.' This is a clear quality control specification.
The paper states the coverage numbers in p_22 and confirms that 'Each scenario maps to a real human job role.' Table 1 is referenced but not shown in the provided text, so the detailed breakdown cannot be verified. The 382 instances spanning all 100
The paper provides specific quantitative metrics in p_23: '382 solvable task instances spanning all 100 scenarios' and 'averages 5.5 tools and 16.2 tool calls per task.' These are self-reported results without external validation.
The paper states in p_23: 'For each task, we select the difficulty level with the lowest autonomous success rate to maximize discriminative power.' This is a clear methodological specification.
The paper states in p_27: 'All data is synthesized in clean environments (E0); faults are injected by appending fault rules to the LES's system prompt during evaluation.' This is a clear specification of the fault injection methodology.
The paper states in p_32: 'All faults are transient (retrying recovers normal results), spaced across the interaction (not concentrated at the start), and parameterized by two independent controls: fault count (number of fault events, default 2) and
The paper lists all 15 models across 8 families in p_35 with specific model names and citations. This is a complete specification of the evaluation scope.
The paper provides specific quantitative results in p_3: E0 (67.5%), E1 explicit faults (62.6%), E2 implicit faults (53.4%). The drop from E0 to E2 (14.1 points) is indeed larger than E0 to E1 (4.9 points). Self-reported results.
The paper provides specific quantitative results in p_37 for fault parameter ablation: Claude Opus 4.6 drops from 71.5% (fc=1) to 60.2% (fc=4), and from 67.8% (fd=1) to 57.9% (fd=4). Qwen 3.5 Plus degrades from 61.3% to 49.7% (count) and 59.7% to 49.
The paper provides specific quantitative results in p_39: gaps of 11.0% (Gemini Pro vs. Flash-Lite), 10.2% (Qwen Plus vs. Flash), and 7.1% (Claude Opus vs. Sonnet 4.6). Self-reported results.
The paper provides specific quantitative results in p_43: Claude Opus shows 61.3% → 65.2% → 71.5% (+10.2% total) across three generations. Self-reported results.
The paper provides specific quantitative results in p_45: GPT-5.2 scales from none (54.7%) to xhigh (82.2%), a 27.5-point improvement. Self-reported results.
... 共 37 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- The complete OCCUBENCH benchmark dataset (100 scenarios across 65 domains) is not publicly available
- Language Environment Simulator implementation details - how tool responses are simulated, architecture, prompts used
- Exact prompts and prompt templates used for each model evaluation
- Evaluation criteria and scoring methodology - how success/failure is determined for each scenario
- Model hyperparameters (temperature, top-p, max tokens, etc.) for all 15 models tested
- API versions and access dates for each model
- Number of evaluation runs per scenario and whether results are averaged
- Statistical significance measures (confidence intervals, standard errors)
- Ground truth answers/expected outputs for the 100 scenarios
- Hardware and computational environment specifications
局限性(作者自述)
- Language Environment Simulators model domain logic rather than domain data. An LES understands that a drug interaction check should return contraindications, but the specific values it returns are generated rather than retrieved from a real database. This means OCCUBENCH evaluates an agent's decision-making process (whether it checks the right things in the right order) rather than its ability to handle exact real-world data values.
- For domains where precise numerical correctness is critical (e.g., financial calculations to the cent), LES-based evaluation should be complemented with real-environment testing.
- As our cross-simulator experiments demonstrate, evaluation results are tied to the specific simulator used during data synthesis. Tasks verified as solvable under Gemini-3-Flash may become unsolvable under a different LES, and agent rankings can shift when the simulator changes. This is an inherent limitation of any LES-based evaluation: the simulator is part of the evaluation apparatus, not a neutral observer.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-23T13:35:16+00:00 · 数据来源:Paper Collector