GrandCode introduces a multi-agent reinforcement learning system with Agentic GRPO for competitive programming. It achieved first place in three consecutive Codeforces rounds, becoming the first AI to consistently surpass top human competitors under live contest conditions, with pass@1 improving fr…
核心问题
Can an AI system be designed to consistently surpass top human competitors in competitive programming under live contest conditions?
核心方法
{'approach': 'GrandCode orchestrates multiple agents (hypothesis model, main solver, summarization model, test-case generator) built on Qwen 3.5, with Agentic GRPO addressing off-policy issues in multi-turn rollouts through immediate reward updates and delayed correction. The training pipeline includes continued pre-training, supervised fine-tuning, multi-component RL, and test-time RL with difficulty-aware routing.', 'key_components': ['A lightweight model (π_hypothesis) based on Qwen-3.5-27B is used for hypothesis generation.', 'SFT data is constructed from Claude-generated hypotheses that are verified to be correct.', 'RL training uses GRPO with reward defined as the proportion of verification tests passed.', 'The model is currently trained independently but will be jointly trained with the solver in the full system.', 'A summarization model is trained to handle reasoning traces exceeding 100K tokens, reducing computational cost.', 'The model maintains progressive summary states across chunked reasoning traces.', 'Progressive training provides denser intermediate supervision compared to end-to-end approaches with sparse rewards.', 'Stage 1 uses GRPO to optimize local summarization steps with rewards for information preservation and reconstruction.', 'Summary-augmented data is mixed into SFT training to help the model handle both summarized and non-summarized contexts.'], 'section_ids': ['sec_13', 'sec_18', 'sec_20']}
论点验证
The paper provides a complete system description across multiple sections (2-10), specifying all major components: hypothesis model, main solver, summarization model, and test-case generator. The multi-agent architecture is fully described with their
The algorithm is fully specified with mathematical formulation in Section 4 (paragraphs 19-32). Equations for immediate reward updates and delayed correction are provided, along with theoretical analysis referenced in Appendix C.
While the paper reports winning three Codeforces rounds, the 'first AI system to consistently surpass' claim requires external validation and comparison to all prior systems. The paper provides no table with detailed rankings, scores, or comparison t
The paper states specific round numbers and dates, but provides no table, figure, or detailed results showing actual rankings, scores, or comparison to other participants. A finding this significant should have quantitative evidence such as a leaderb
The design choice is clearly stated with rationale (accessible SFT pipeline and multimodal capabilities). The paper consistently uses Qwen 3.5 throughout the system description.
The paper provides specific details about which models are used where: paragraph 36 mentions Claude, GPT, DeepSeek V3, and Kimi 2.5 for test case generation; paragraph 53 mentions Gemini 3.1 Pro and Claude 4.6 for data generation; paragraph 56 lists
The design choice is stated but not justified. No information is provided about how the classifier is trained, what data it uses, its accuracy, or why five levels are appropriate. No ablation or analysis is provided to justify this choice.
Same as claim 2 - the algorithm is fully specified with mathematical formulation.
The algorithm design is clearly explained with the core idea explicitly stated. The contrast with standard GRPO is provided.
This is a subjective assessment of difficulty. While stated as a challenge, there's no objective way to verify this claim from the paper alone.
The limitation is clearly explained with logical reasoning about why large-size test cases are problematic for verification.
The design choice is clearly stated and both strategies are described in detail in subsequent paragraphs.
The design choice is clearly stated with specific models mentioned.
The design choice is clearly stated with reference to prior work (CodeContests+).
The finding is stated with specific numbers (42 to 48) directly in the text. While Table 1 is referenced, the key numbers are provided in the text.
The finding is stated with specific numbers (remaining 2 failures resolved to reach 50/50) directly in the text.
The design choice is clearly stated with specific model (Qwen-3.5-27B) mentioned.
The design choice is clearly described with the SFT data construction process.
The design choice is clearly stated with the reward formula provided.
The limitation is clearly acknowledged with reasoning about why independent training is imperfect.
... 共 62 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- No code available - all implementation details must be reverse-engineered from paper
- No training/evaluation data available - dataset construction details are incomplete (exact problem sources, number of problems, selection criteria)
- Missing SFT hyperparameters: learning rate, batch size, number of epochs, optimizer settings
- Missing GRPO hyperparameters: group size G, learning rate, number of training steps, KL penalty coefficient
- Missing LoRA parameters: rank, alpha, dropout rate for test-time RL
- Missing random seeds for reproducibility
- Missing hardware specifications: GPU types, memory requirements, training duration
- Missing exact reward function formulations - score equations are referenced but not fully specified
- Missing chunk size and partitioning strategy for reasoning traces
- Missing exact prompts/templates used for Claude hypothesis generation
局限性(作者自述)
- it is difficult to generate genuinely adversarial test cases that expose subtle logical errors rather than merely checking superficial correctness
- large-size test cases are hard to use for verification, because in many settings the only trusted solver available to us is a brute-force implementation, and that solver times out on large inputs, which makes we don't have gold outputs to compare with for large-size inputs
- At the current stage, π_hypothesis is trained independently for simplicity. However, this is inherently imperfect, since generating a correct hypothesis does not necessarily imply that it is helpful to generate the correct solution
- At this stage, the data can be noisy because it is partly generated, and some synthesized reasoning traces or answers may be incorrect
- It is worth noting that this strategy is effective for relatively easy tasks, where we can often recover a solution close to the gold one. For harder tasks, however, the generated solution frequently fails to match the gold solution, making this simple filtering procedure insufficient
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-10T01:18:11+00:00 · 数据来源:Paper Collector