GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents - AI 论文深度分析

TL;DR
GameWorld introduces a standardized benchmark for multimodal game agents with 34 browser games and 170 tasks, featuring state-verifiable evaluation. Testing 18 model-interface pairs reveals agents achieve partial progress (best PG=41.9) but remain far below human novice performance (55.3 SR/64.

已证实

证据不足

无法验证

N/A

可复现性

置信度

84%

核心问题

How can multimodal game agents be evaluated in a standardized, reproducible manner across diverse browser games, and what are the current capabilities and limitations of such agents?

核心方法

{'approach': 'The authors created GameWorld, a benchmark with 34 browser games spanning 5 genres and 170 tasks, featuring two agent interfaces (Computer-Use and Generalist) that normalize outputs into a shared executable action space. A sandbox environment pauses game execution during model inference to decouple latency from gameplay, while outcome-based state-verifiable evaluation uses a JavaScript bridge exposing gameAPI state for deterministic metrics.', 'key_components': ['Malformed format can result from truncation of very long reasoning.', 'Tool-call blocks may remain unclosed when output is truncated.', 'Models may return plausible tool calls that are not registered in the game.', 'Semantic actions like craft_a_workbench() may not be valid in the control space.', 'CUA models may attempt mouse clicks in keyboard-only games.', 'Actions outside the allowed control space result in invalid responses.'], 'section_ids': ['sec_58', 'sec_60', 'sec_61']}

论点验证

已证实 (95%) we introduce GameWorld, a standardized benchmark for multimodal game agents in browser environments. GameWorld comprises 34 browser games spanning five genres (Runner, Arcade, Platformer, Puzzle, and Simulation) with 170 diverse tasks.
The paper provides concrete evidence: Table 4 (p_26-p_62) lists all 34 games with descriptions, p_3 explicitly states '34 browser games spanning five genres (Runner, Arcade, Platformer, Puzzle, and Simulation) with 170 diverse tasks.' The game invent

已证实 (90%) Each task is paired with an outcome-based state-verifiable evaluator over serialized gameAPI state, producing deterministic progress and success signals without perceptual noise.
The paper provides detailed technical evidence: p_63 describes the JavaScript bridge, gameAPI state serialization, and evaluation process. p_68-69 provide formal metric definitions with mathematical formulas for progress (PG) and success rate (SR). T

已证实 (85%) We further establish GameWorld-RT, an unpaused real-time benchmark variant in which environment dynamics continue during inference, making response latency part of the task itself.
GameWorld-RT is described in p_76 with concrete results reported in Table 8. The paper provides quantitative results comparing Qwen3-VL-30B-A3B and Qwen3-VL-235B-A22B in both Generalist and CUA interfaces under real-time conditions.

已证实 (90%) A standardized and comprehensive benchmark for multimodal game agents. GameWorld provides 34 browser games spanning 5 genres and 170 tasks. It supports both Computer-Use Agents and Generalist Multimodal Agents under a shared executable action space via deterministic Semantic Action Parsing, together with a sandbox that decouples inference latency from gameplay, enabling standardized evaluation across different control interfaces.
This is a comprehensive claim combining benchmark scope (supported by Table 4), dual interface support (p_11-12, Table 1), semantic action parsing (p_15), and sandbox design (p_24-25). All components are documented with technical details.

已证实 (85%) A universal outcome-based state-verifiable evaluator. Unlike prior game benchmarks that rely on noisy visual heuristics or VLM-as-judge pipelines, GameWorld evaluates entirely through outcome-based metrics computed from serialized gameAPI state. We compute deterministic task success and normalized progress directly from task-relevant game variables, ensuring noise-free and fully reproducible evaluation.
The paper provides detailed technical specification of the evaluator in p_63, including the JavaScript bridge injection, gameAPI state structure, and metric computation. The claim of 'noise-free' evaluation is supported by the deterministic state-bas

已证实 (85%) A suite of interface-aware benchmark analyses. GameWorld contributes repeated-evaluation robustness studies to characterize the reproducibility of the benchmark itself. It further provides capability-aligned curriculum analyses, the real-time benchmark variant GameWorld-RT, context-memory sensitivity analysis, and action-validity diagnostics to study latency coupling, capability bottlenecks, context-memory trade-offs, and instruction-following reliability across both game-agent interfaces.
The paper provides evidence for each analysis component: repeated-evaluation robustness (p_73, Figure 4), curriculum analyses (p_74), GameWorld-RT (p_76-77), context-memory sensitivity (p_90-91, Table 9), and action-validity diagnostics (p_92-96, Tab

已证实 (90%) A browser-based sandbox pauses game execution during model inference, decoupling inference latency from gameplay so that scores reflect decision quality rather than response speed.
The sandbox design is clearly documented in p_24-25 with explicit description of the pause mechanism during model inference. The technical implementation is detailed in Appendix C (p_120-131) including browser manager and readiness gate.

已证实 (90%) we study two agent interfaces: Computer-Use Agents (CUAs), which emit raw keyboard and mouse controls, and Generalist Multimodal Agents, which act through deterministic Semantic Action Parsing.
The two agent interfaces are clearly defined in p_3, p_11-12, and Table 1. Computer-Use Agents emit raw keyboard/mouse controls (p_13), while Generalist Multimodal Agents use semantic actions parsed through deterministic Semantic Action Parsing (p_15

已证实 (90%) The agent's raw output is normalized into a shared set of executable atomic events: mouse_move, mouse_down, mouse_up, key_down, key_up, scroll, and wait. These events define the executor-level unified control space.
The unified control space is explicitly defined in p_11 and Table 1 with the seven atomic events listed: mouse_move, mouse_down, mouse_up, key_down, key_up, scroll, and wait. The normalization process is further detailed in p_136-140.

已证实 (90%) we distinguish two game-agent interfaces: (i) Computer-Use Agents that directly emit low-level keyboard and mouse controls (Section 2.2), and (ii) Generalist Multimodal Agents that act in a semantic space and are executed through deterministic Semantic Action Parsing (Section 2.3).
This is a restatement of the interface distinction documented in p_12, with Computer-Use Agents described in p_13-14 and Generalist Multimodal Agents in p_15. The design is clearly specified with implementation details.

已证实 (90%) We enforce a one-action-per-step constraint for evaluation consistency over CUAs: each model response must contain exactly one executable action that satisfies the role-specific keyboard or mouse control specification.
The one-action-per-step constraint is explicitly stated in p_14 for CUAs and reinforced in p_15 for Generalist agents (Action Atomicity). This design choice is clearly documented as an evaluation consistency measure.

已证实 (85%) Actions that fall outside the game's permitted control interface (e.g., OS-level APIs) are rejected, ensuring that CUA scores reflect in-game capabilities under a fixed action budget.
The rejection of out-of-scope actions is documented in p_14 and detailed in p_141 (legality checker). The paper describes how invalid actions are logged and ignored at execution time.

已证实 (85%) we introduce Semantic Action Parsing: for each game and role, a deterministic parser maps every semantic action to a fixed low-level interaction command under the same unified runtime contract.
Semantic Action Parsing is introduced in p_15 with description of the deterministic parser mapping semantic actions to low-level commands. Further implementation details are in p_142 describing the resolution process.

已证实 (90%) We further enforce Action Atomicity at the model-step level: each model response must specify one interaction command per step. What is disallowed is any multi-command macro that bundles several semantically distinct decisions into one step.
Action Atomicity is explicitly enforced in p_15, stating 'each model response must specify one interaction command per step' and explicitly disallowing 'multi-command macro that bundles several semantically distinct decisions.'

已证实 (85%) we wrap each model in a shared agent harness that standardizes these components across all models.
The shared agent harness is described in p_16, standardizing prompts, memory, and tool interfaces across models. Appendix D is referenced for implementation details.

已证实 (90%) we define a fixed prompt template with four components: #Game Rules, #Role and Controls, #Task Instruction, and #Output Format. The template structure stays constant across all experiments; only the game-specific rules, role description, and task objective are swapped per configuration, keeping cross-model comparisons controlled.
The fixed prompt template with four components is specified in p_17, with the structure explicitly listed in p_144-146. The paper states the template structure stays constant with only game-specific content swapped.

已证实 (90%) The agent maintains a rolling memory module that stores the most recent rounds of interaction. Each round records the sequence user_prompt → screenshot → reasoning → action, and recent rounds are prepended as an Action History block before the current observation.
The rolling memory module is described in p_18 with the sequence 'user_prompt → screenshot → reasoning → action' and Action History block. Further details in p_135 describe the memory store implementation.

已证实 (85%) We register the game's semantic actions and computer-use primitives as callable tools for each model, using each model provider's native function-calling (also known as tool-calling) interface (e.g., OpenAI function calling, Claude tool use, Gemini function declarations).
The tool registration using native function-calling interfaces is described in p_20, with examples given for OpenAI, Claude, and Gemini. This design choice leverages each model's native capabilities.

已证实 (90%) the sandbox can pause game execution during model inference, so every agent faces identical game dynamics regardless of response latency. Scores then reflect what the agent decides, not how fast it responds.
The pause mechanism is clearly described in p_24 as the central design goal of the sandbox, ensuring 'every agent faces identical game dynamics regardless of response latency.'

已证实 (85%) the sandbox also supports configurable game speed and deterministic seed settings for evaluation reproducibility.
Configurable game speed and deterministic seed settings are mentioned in p_25 for evaluation reproducibility. The implementation includes 'deterministic-randomness script by overriding JavaScript headers' (p_120).

... 共 46 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

Code availability - no code repository or implementation details provided
Data availability - GameWorld benchmark and game environments not publicly accessible
Model hyperparameters - temperature, top_p, max_tokens, and other API parameters for proprietary models not specified
Random seeds - no seed information for reproducible evaluation runs
Hardware specifications - GPU/compute resources used for running models not detailed
Exact execution duration - only stated as 'usually 200-500 ms' without precise values
Number of evaluation episodes/runs per game
Game environment configurations and specific game titles used
Prompts and instructions given to each model
Verifier implementation details and scoring criteria

局限性（作者自述）

The benchmark necessitates designing unique instruction sets for each new environment, which tightly couples the action space to the task and constrains the model's scalability.
Automating the producing and alignment process of Semantic Action Parsing through MLLM-powered agent exploration is left for future work.

本分析由 PDF 阅读助手自动生成，仅供参考，不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析，可能存在偏差。原始论文请参阅 arXiv。

分析时间：2026-04-19T07:31:07+00:00 · 数据来源：Paper Collector