GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning - AI 论文深度分析

TL;DR
GrandCode introduces a multi-agent reinforcement learning system with Agentic GRPO for competitive programming. It achieved first place in three consecutive Codeforces rounds, becoming the first AI to consistently surpass top human competitors under live contest conditions, with pass@1 improving fr…

已证实

证据不足

无法验证

N/A

可复现性

置信度

74%

核心问题

Can an AI system be designed to consistently surpass top human competitors in competitive programming under live contest conditions?

核心方法

{'approach': 'GrandCode orchestrates multiple agents (hypothesis model, main solver, summarization model, test-case generator) built on Qwen 3.5, with Agentic GRPO addressing off-policy issues in multi-turn rollouts through immediate reward updates and delayed correction. The training pipeline includes continued pre-training, supervised fine-tuning, multi-component RL, and test-time RL with difficulty-aware routing.', 'key_components': ['A lightweight model (π_hypothesis) based on Qwen-3.5-27B is used for hypothesis generation.', 'SFT data is constructed from Claude-generated hypotheses that are verified to be correct.', 'RL training uses GRPO with reward defined as the proportion of verification tests passed.', 'The model is currently trained independently but will be jointly trained with the solver in the full system.', 'A summarization model is trained to handle reasoning traces exceeding 100K tokens, reducing computational cost.', 'The model maintains progressive summary states across chunked reasoning traces.', 'Progressive training provides denser intermediate supervision compared to end-to-end approaches with sparse rewards.', 'Stage 1 uses GRPO to optimize local summarization steps with rewards for information preservation and reconstruction.', 'Summary-augmented data is mixed into SFT training to help the model handle both summarized and non-summarized contexts.'], 'section_ids': ['sec_13', 'sec_18', 'sec_20']}

论点验证

已证实 (85%) we introduce GrandCode, a multi-agent reinforcement learning system designed for competitive programming
The paper provides a complete system description across multiple sections (2-10), specifying all major components: hypothesis model, main solver, summarization model, and test-case generator. The multi-agent architecture is fully described with their

已证实 (80%) we introduce Agentic GRPO, a variant of Group Relative Policy Optimization that combines immediate reward updates with delayed correction, enabling more effective credit assignment under long, multi-stage rollouts and asynchronous training
The algorithm is fully specified with mathematical formulation in Section 4 (paragraphs 19-32). Equations for immediate reward updates and delayed correction are provided, along with theoretical analysis referenced in Appendix C.

证据不足 (50%) GrandCode is the first AI system to consistently surpass the best human competitors in competitive programming under live contest conditions
While the paper reports winning three Codeforces rounds, the 'first AI system to consistently surpass' claim requires external validation and comparison to all prior systems. The paper provides no table with detailed rankings, scores, or comparison t

证据不足 (50%) in the three most recent Codeforces rounds under standard live contest conditions: Round 1087 on March 21, Round 1088 on March 28, and Round 1089 on March 29, 2026, GrandCode placed first in all three contests, outperforming every human participant, including multiple top-ranked legendary grandmasters
The paper states specific round numbers and dates, but provides no table, figure, or detailed results showing actual rankings, scores, or comparison to other participants. A finding this significant should have quantitative evidence such as a leaderb

已证实 (80%) It builds on Qwen 3.5 as the foundation model, chosen for its accessible SFT pipeline and multimodal capabilities
The design choice is clearly stated with rationale (accessible SFT pipeline and multimodal capabilities). The paper consistently uses Qwen 3.5 throughout the system description.

已证实 (85%) We also employ models such as Kimi 2.5, GLM, and other closed-source LLMs for data generation in several modules
The paper provides specific details about which models are used where: paragraph 36 mentions Claude, GPT, DeepSeek V3, and Kimi 2.5 for test case generation; paragraph 53 mentions Gemini 3.1 Pro and Claude 4.6 for data generation; paragraph 56 lists

证据不足 (40%) We fine-tune a lightweight classifier to assign each task to one of five difficulty levels, where Level 1 denotes the easiest problems and Level 5 the hardest
The design choice is stated but not justified. No information is provided about how the classifier is trained, what data it uses, its accuracy, or why five levels are appropriate. No ablation or analysis is provided to justify this choice.

已证实 (80%) we propose Agentic GRPO with Immediate Reward and Delayed Correction, which is designed to handle multi-stage rollouts in agentic settings under the asynchronous training framework
Same as claim 2 - the algorithm is fully specified with mathematical formulation.

已证实 (85%) In the standard GRPO, we update the whole trajectory with the final reward r_N and ignore the intermediate rewards r_1, r_2, . . . , r_{N-1}. The core idea behind Agentic GRPO is that to have the trainer update the policy as soon as possible once an intermediate reward r_t is available instead of waiting for the final reward r_N
The algorithm design is clearly explained with the core idea explicitly stated. The contrast with standard GRPO is provided.

无法验证 (60%) it is difficult to generate genuinely adversarial test cases that expose subtle logical errors rather than merely checking superficial correctness
This is a subjective assessment of difficulty. While stated as a challenge, there's no objective way to verify this claim from the paper alone.

已证实 (75%) large-size test cases are hard to use for verification, because in many settings the only trusted solver available to us is a brute-force implementation, and that solver times out on large inputs, which makes we don't have gold outputs to compare with for large-size inputs
The limitation is clearly explained with logical reasoning about why large-size test cases are problematic for verification.

已证实 (80%) We adopt two strategies for adversarial test generation: difference-driven test generation and solution attack
The design choice is clearly stated and both strategies are described in detail in subsequent paragraphs.

已证实 (80%) we iteratively generate candidate test cases by prompting Claude, GPT, DeepSeek V3, and Kimi 2.5 to produce inputs that are likely to reveal corner cases
The design choice is clearly stated with specific models mentioned.

已证实 (80%) we further employ a generator-validator framework inspired by CodeContests+, in which an additional LLM validator is used to filter or refine the generated tests
The design choice is clearly stated with reference to prior work (CodeContests+).

已证实 (75%) We evaluate test case generation on 50 real Codeforces problems by submitting solutions to the real Codeforces website and using the real system as the final criterion. The pass count increases from 42 to 48 after applying difference-driven test case generation and solution attack
The finding is stated with specific numbers (42 to 48) directly in the text. While Table 1 is referenced, the key numbers are provided in the text.

已证实 (75%) For the remaining two failures, we further incorporate submission feedback and continue generating additional test cases online, which raises the pass count to all of 50 tests
The finding is stated with specific numbers (remaining 2 failures resolved to reach 50/50) directly in the text.

已证实 (80%) We use a lightweight model for this component rather than allocating a large model to every instance. We denote this model by π_hypothesis and use Qwen-3.5-27B as the base model for π_hypothesis
The design choice is clearly stated with specific model (Qwen-3.5-27B) mentioned.

已证实 (80%) To construct SFT data, we use Claude to generate multiple candidate hypotheses given only the problem statement, and retain those that are verified to be correct
The design choice is clearly described with the SFT data construction process.

已证实 (85%) After SFT, we further optimize π_hypothesis with reinforcement learning using GRPO. At the current stage, the reward for a generated hypothesis h is the proportion of verification tests it passes
The design choice is clearly stated with the reward formula provided.

已证实 (80%) At the current stage, π_hypothesis is trained independently for simplicity. However, this is inherently imperfect, since generating a correct hypothesis does not necessarily imply that it is helpful to generate the correct solution
The limitation is clearly acknowledged with reasoning about why independent training is imperfect.

... 共 62 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

No code available - all implementation details must be reverse-engineered from paper
No training/evaluation data available - dataset construction details are incomplete (exact problem sources, number of problems, selection criteria)
Missing SFT hyperparameters: learning rate, batch size, number of epochs, optimizer settings
Missing GRPO hyperparameters: group size G, learning rate, number of training steps, KL penalty coefficient
Missing LoRA parameters: rank, alpha, dropout rate for test-time RL
Missing random seeds for reproducibility
Missing hardware specifications: GPU types, memory requirements, training duration
Missing exact reward function formulations - score equations are referenced but not fully specified
Missing chunk size and partitioning strategy for reasoning traces
Missing exact prompts/templates used for Claude hypothesis generation

局限性（作者自述）

it is difficult to generate genuinely adversarial test cases that expose subtle logical errors rather than merely checking superficial correctness
large-size test cases are hard to use for verification, because in many settings the only trusted solver available to us is a brute-force implementation, and that solver times out on large inputs, which makes we don't have gold outputs to compare with for large-size inputs
At the current stage, π_hypothesis is trained independently for simplicity. However, this is inherently imperfect, since generating a correct hypothesis does not necessarily imply that it is helpful to generate the correct solution
At this stage, the data can be noisy because it is partly generated, and some synthesized reasoning traces or answers may be incorrect
It is worth noting that this strategy is effective for relatively easy tasks, where we can often recover a solution close to the gold one. For harder tasks, however, the generated solution frequently fails to match the gold solution, making this simple filtering procedure insufficient

本分析由 PDF 阅读助手自动生成，仅供参考，不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析，可能存在偏差。原始论文请参阅 arXiv。

分析时间：2026-04-10T01:18:11+00:00 · 数据来源：Paper Collector