GameWorld introduces a standardized benchmark for multimodal game agents with 34 browser games and 170 tasks, featuring state-verifiable evaluation. Testing 18 model-interface pairs reveals agents achieve partial progress (best PG=41.9) but remain far below human novice performance (55.3 SR/64.
核心问题
How can multimodal game agents be evaluated in a standardized, reproducible manner across diverse browser games, and what are the current capabilities and limitations of such agents?
核心方法
{'approach': 'The authors created GameWorld, a benchmark with 34 browser games spanning 5 genres and 170 tasks, featuring two agent interfaces (Computer-Use and Generalist) that normalize outputs into a shared executable action space. A sandbox environment pauses game execution during model inference to decouple latency from gameplay, while outcome-based state-verifiable evaluation uses a JavaScript bridge exposing gameAPI state for deterministic metrics.', 'key_components': ['Malformed format can result from truncation of very long reasoning.', 'Tool-call blocks may remain unclosed when output is truncated.', 'Models may return plausible tool calls that are not registered in the game.', 'Semantic actions like craft_a_workbench() may not be valid in the control space.', 'CUA models may attempt mouse clicks in keyboard-only games.', 'Actions outside the allowed control space result in invalid responses.'], 'section_ids': ['sec_58', 'sec_60', 'sec_61']}
论点验证
The paper provides concrete evidence: Table 4 (p_26-p_62) lists all 34 games with descriptions, p_3 explicitly states '34 browser games spanning five genres (Runner, Arcade, Platformer, Puzzle, and Simulation) with 170 diverse tasks.' The game invent
The paper provides detailed technical evidence: p_63 describes the JavaScript bridge, gameAPI state serialization, and evaluation process. p_68-69 provide formal metric definitions with mathematical formulas for progress (PG) and success rate (SR). T
GameWorld-RT is described in p_76 with concrete results reported in Table 8. The paper provides quantitative results comparing Qwen3-VL-30B-A3B and Qwen3-VL-235B-A22B in both Generalist and CUA interfaces under real-time conditions.
This is a comprehensive claim combining benchmark scope (supported by Table 4), dual interface support (p_11-12, Table 1), semantic action parsing (p_15), and sandbox design (p_24-25). All components are documented with technical details.
The paper provides detailed technical specification of the evaluator in p_63, including the JavaScript bridge injection, gameAPI state structure, and metric computation. The claim of 'noise-free' evaluation is supported by the deterministic state-bas
The paper provides evidence for each analysis component: repeated-evaluation robustness (p_73, Figure 4), curriculum analyses (p_74), GameWorld-RT (p_76-77), context-memory sensitivity (p_90-91, Table 9), and action-validity diagnostics (p_92-96, Tab
The sandbox design is clearly documented in p_24-25 with explicit description of the pause mechanism during model inference. The technical implementation is detailed in Appendix C (p_120-131) including browser manager and readiness gate.
The two agent interfaces are clearly defined in p_3, p_11-12, and Table 1. Computer-Use Agents emit raw keyboard/mouse controls (p_13), while Generalist Multimodal Agents use semantic actions parsed through deterministic Semantic Action Parsing (p_15
The unified control space is explicitly defined in p_11 and Table 1 with the seven atomic events listed: mouse_move, mouse_down, mouse_up, key_down, key_up, scroll, and wait. The normalization process is further detailed in p_136-140.
This is a restatement of the interface distinction documented in p_12, with Computer-Use Agents described in p_13-14 and Generalist Multimodal Agents in p_15. The design is clearly specified with implementation details.
The one-action-per-step constraint is explicitly stated in p_14 for CUAs and reinforced in p_15 for Generalist agents (Action Atomicity). This design choice is clearly documented as an evaluation consistency measure.
The rejection of out-of-scope actions is documented in p_14 and detailed in p_141 (legality checker). The paper describes how invalid actions are logged and ignored at execution time.
Semantic Action Parsing is introduced in p_15 with description of the deterministic parser mapping semantic actions to low-level commands. Further implementation details are in p_142 describing the resolution process.
Action Atomicity is explicitly enforced in p_15, stating 'each model response must specify one interaction command per step' and explicitly disallowing 'multi-command macro that bundles several semantically distinct decisions.'
The shared agent harness is described in p_16, standardizing prompts, memory, and tool interfaces across models. Appendix D is referenced for implementation details.
The fixed prompt template with four components is specified in p_17, with the structure explicitly listed in p_144-146. The paper states the template structure stays constant with only game-specific content swapped.
The rolling memory module is described in p_18 with the sequence 'user_prompt → screenshot → reasoning → action' and Action History block. Further details in p_135 describe the memory store implementation.
The tool registration using native function-calling interfaces is described in p_20, with examples given for OpenAI, Claude, and Gemini. This design choice leverages each model's native capabilities.
The pause mechanism is clearly described in p_24 as the central design goal of the sandbox, ensuring 'every agent faces identical game dynamics regardless of response latency.'
Configurable game speed and deterministic seed settings are mentioned in p_25 for evaluation reproducibility. The implementation includes 'deterministic-randomness script by overriding JavaScript headers' (p_120).
... 共 46 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- Code availability - no code repository or implementation details provided
- Data availability - GameWorld benchmark and game environments not publicly accessible
- Model hyperparameters - temperature, top_p, max_tokens, and other API parameters for proprietary models not specified
- Random seeds - no seed information for reproducible evaluation runs
- Hardware specifications - GPU/compute resources used for running models not detailed
- Exact execution duration - only stated as 'usually 200-500 ms' without precise values
- Number of evaluation episodes/runs per game
- Game environment configurations and specific game titles used
- Prompts and instructions given to each model
- Verifier implementation details and scoring criteria
局限性(作者自述)
- The benchmark necessitates designing unique instruction sets for each new environment, which tightly couples the action space to the task and constrains the model's scalability.
- Automating the producing and alignment process of Semantic Action Parsing through MLLM-powered agent exploration is left for future work.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-19T07:31:07+00:00 · 数据来源:Paper Collector