OpenGame introduces the first open-source agentic framework for end-to-end web game creation, combining GameCoder-27B with Template and Debug Skills. On OpenGame-Bench (150 tasks), it achieves BH=63.9, VU=57.0, IA=54.1, outperforming all direct LLM baselines, with peak performance reaching BH=72.
核心问题
How can LLM-based agents be enabled to reliably generate playable, multi-file web games from natural language specifications, overcoming the logical incoherence, engine-specific knowledge gaps, and cross-file inconsistencies that plague current frontier models?
核心方法
{'approach': 'The authors develop OpenGame, combining GameCoder-27B (trained via continual pre-training, supervised fine-tuning, and execution-grounded reinforcement learning on game repositories) with a six-phase agentic workflow. The framework introduces Game Skill comprising Template Skill for project scaffolding and Debug Skill for systematic error repair, evaluated dynamically on OpenGame-Bench with 150 browser game tasks.', 'key_components': ['OpenGame combines a domain-specialized code model with a structured multimodal coding agent.', 'The methodology has three pillars: base model training, autonomous game-generation workflow, and agent evolution through reusable skills.', 'GameCoder-27B is built on Qwen3.5-27B backbone to address multi-file synthesis challenges.', 'The training pipeline consists of three stages: Continual Pre-Training, Supervised Fine-Tuning, and Reinforcement Learning.', 'Sequential ablation of training stages: CPT, SFT, RL', 'Base model: Qwen-3.5-27B', 'OpenGame agentic framework kept fixed during ablation', 'Claude Sonnet 4.6 backend used throughout to isolate system design', 'Core routing and context-management mechanisms disabled one at a time', 'Focus on Autonomous Agent Workflow components'], 'section_ids': ['sec_3', 'sec_4', 'sec_14', 'sec_16']}
论点验证
The paper states these three failure modes as observations but provides no systematic quantitative analysis of frontier model failures. No data is presented on failure frequencies, no comparative analysis across models, and no controlled experiments
This is a novelty claim ('first open-source agentic framework') that requires external verification against the broader research landscape. The paper does not provide systematic comparison with existing frameworks to substantiate the 'first' claim. T
Game Skill is a core contribution that is fully specified (Template Skill + Debug Skill) and demonstrated through experiments. The ablation studies in Table 4 show the contribution of these components to overall performance.
Template Skill is fully specified and its contribution is demonstrated through ablation. Table 4 shows the improvement from M0 to the full evolved library L, with specific performance metrics.
Debug Skill is fully specified with the living debugging protocol P, and its contribution is demonstrated through ablation studies showing performance improvements.
GameCoder-27B is described as a domain-specialized model with its training pipeline specified. The model is used in experiments and its contribution is shown through ablation.
The three-stage training pipeline (CPT, SFT, RL) is fully specified and its contribution is demonstrated through sequential ablation in paragraph 39.
OpenGame-Bench is introduced as a complete evaluation pipeline with 150 tasks, and is used throughout the experiments to evaluate all systems.
The three evaluation dimensions (Build Health, Visual Usability, Intent Alignment) are fully specified with their scoring methodology, and are used consistently throughout the experiments.
Same as claim 6 - GameCoder-27B training is described and demonstrated through experiments and ablation.
Same as claim 8 - OpenGame-Bench is introduced and used for evaluation.
The choice of Qwen3.5-27B as the backbone is clearly stated. This is a straightforward design choice that doesn't require additional justification.
Same as claim 7 - three-stage pipeline is specified and ablated.
While the paper mentions assembling a 'large-scale pre-training corpus,' no details are provided about corpus size, number of repositories, data quality filtering, or composition statistics. The claim lacks concrete evidence about what constitutes 'l
The use of specific models (gpt-codex5.1 for prompts, minimax2.5 for solutions) for synthetic data generation is clearly stated as part of the SFT pipeline.
The RL stage with execution-based feedback at component level is described, including the reward computation from execution success and test pass rate.
The six operational phases are fully specified and described in detail in the methodology section.
The Physics-First Classification rule is described with specific examples of how physical constraints map to archetypes.
The Three-Layer Reading Strategy is described but not justified through ablation or analysis. No evidence is provided that this specific strategy improves performance compared to alternatives.
The Template Method Pattern approach is clearly described with specific examples of hook methods.
... 共 38 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- 代码不可用 - 未提供任何代码实现
- 数据不可用 - 训练数据和评估数据集均未公开
- 训练超参数缺失 - 未提供CPT、SFT、RL各阶段的学习率、批次大小、训练轮数等关键超参数
- 随机种子未指定 - 虽然提到使用不同随机种子评估三次,但未给出具体种子值
- 硬件环境未说明 - 未提供GPU/TPU型号、数量、训练时间等计算资源信息
- RL训练细节缺失 - 未说明强化学习算法类型、奖励函数设计、训练策略等
- 训练数据详情缺失 - CPT、SFT各阶段使用的数据集来源、规模、格式均未说明
- VLM评估器未指定 - 用于VU和IA评估的视觉语言模型未明确说明
- Agent工作流细节不足 - 自主游戏生成工作流的具体实现和配置参数缺失
- 基线实现细节缺失 - 基线模型的运行配置和参数设置未提供
局限性(作者自述)
- Even the full OpenGame system leaves approximately 34.9% of weighted mechanical requirements partially or fully unsatisfied.
- In these games, logical state management-for example, inventory tracking or match-three rules-is more weakly coupled to visible rendering. When logic desynchronizes, the resulting failures are often silent, triggering neither compiler warnings nor runtime crashes. The lack of explicit trace signals makes such errors substantially harder for the agent to detect and repair during automated debugging, highlighting an important direction for future work.
- This ceiling reflects the intrinsic difficulty of translating ambiguous natural-language prompts into self-consistent, playable multi-file systems spanning logic, rendering, and asset management.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-24T01:30:27+00:00 · 数据来源:Paper Collector