TL;DR
Identifies why on-policy self-distillation fails in RLVR (irreducible information gaps causing leakage) and proposes RLSD, which decouples update direction from magnitude. RLSD achieves 4.69% improvement over base models with faster convergence and stability.
33
已证实
5
证据不足
2
无法验证
N/A
可复现性
置信度
82%

核心问题

Why does on-policy self-distillation (OPSD) fail in reinforcement learning with verifiable rewards settings, and how can self-distillation be reformulated to provide token-level supervision without causing privileged information leakage?

核心方法

{'approach': "The authors provide theoretical analysis proving OPSD's irreducible mutual information gap causes information leakage, then propose RLSD which repurposes teacher-student discrepancy as a credit assignment signal within GRPO's policy gradient framework. The method uses environment reward for update direction and teacher's evidence ratio for magnitude modulation, evaluated on Qwen3-VL-8B-Instruct across five multimodal reasoning benchmarks.", 'key_components': ["OPSD-trained models systematically reference privileged information unavailable at inference time, such as invisible 'reference solutions'.", 'OPD teacher is defined as an external model, while OPSD teacher is the same model conditioned on privileged information r.', 'Both methods share the training objective of minimizing per-token divergence D between teacher and student distributions.', 'Gradients are backpropagated only through the student distribution P_S, with the teacher P_T serving as a fixed target.', 'Base model is Qwen3-VL-8B-Instruct with comparisons against GRPO, OPSD, SDPO, and GRPO+OPSD baselines.', 'Training uses batch size 256, 8 rollouts per prompt, temperature 1.0, and learning rate 1×10^-6 for GRPO-based methods.', 'RLSD hyperparameters: λ initialized at 0.5 and decayed to 0 over 50 steps, ε_w = 0.2, teacher synced every 10 steps.', "RLSD requires only ground-truth answers as privileged information, less demanding than OPSD's verified reasoning traces."], 'section_ids': ['sec_4', 'sec_17']}

论点验证

已证实 (90%) We propose RLVR with Self-Distillation (RLSD), which instantiates this principle by repurposing the teacher from a generative target to a magnitude evaluator.
The paper provides a complete specification of the RLSD method in Section 4, including the algorithmic formulation (Eqs. 10-15), the three-step procedure (privileged information gain, direction-aware evidence reweighting, clipped credit assignment),
已证实 (95%) We identify the root cause of OPSD's failure through controlled experiments and formal analysis, proving that distribution matching under information asymmetry induces an irreducible gap that drives privileged information leakage through the gradient structure.
The paper provides both controlled experiments (Figure 3 showing leakage frequency, validation accuracy, and KL divergence patterns) and formal analysis (Theorem 1 proving the irreducible mutual information gap, Proposition 1 characterizing gradient
已证实 (95%) Theorem 1 (KL Decomposition). The OPSD objective and the ideal objective satisfy the identity: D_KL(P_S || P_T(·|r)) = D_KL(P_S || P_T) + I(Y_t; R | X, Y_{The theorem is formally stated with the complete mathematical identity. The proof is provided in Appendix A.1 (p_85-p_89), showing the derivation step-by-step through KL decomposition and conditional mutual information.
已证实 (95%) Proposition 1 (Per-Sample Gradient Decomposition). For any specific realization of r, the per-sample gradient admits the decomposition: g(θ; r) = g*(θ) + δ(θ; r)
The proposition is formally stated with properties (i) and (ii). The proof is provided in Appendix A.2 (p_90-p_92), showing the decomposition g(θ; r) = g*(θ) + δ(θ; r) and the variance relationship.
已证实 (90%) Theorem 3 (Impossibility Trilemma). In any distribution-matching framework where teacher and student share parameters, the following three properties cannot hold simultaneously: (a) Objective stability, (b) Sustained improvement, (c) Leakage-free training.
The theorem is stated in p_115 and the analysis is provided in Appendix A.4 (p_102-p_115), examining two strategies (frozen teacher and evolving teacher) and showing neither can satisfy all three properties simultaneously under shared parameters.
已证实 (95%) We prove that this asymmetry renders the OPSD objective ill-posed: it contains an irreducible mutual information gap I(Y_t; R | X, Y_{ 0 that the student can never eliminate, regardless of its capacity (Theorem 1).
This claim directly follows from Theorem 1, which proves the KL decomposition identity showing the irreducible mutual information gap. The theorem and proof are provided, establishing that I(Y_t; R | X, Y_{ 0 is independent of θ and cannot be e
已证实 (85%) RLSD achieves the highest average accuracy, outperforming the Base LLM by 4.69% and GRPO by 2.32% on average under the 4K setting.
The paper reports specific quantitative results in Table 2 (referenced in p_73): RLSD achieves highest average accuracy, outperforming Base LLM by 4.69% and GRPO by 2.32% under the 4K setting. These are concrete numerical results.
已证实 (85%) RLSD benefits from dense token-level credit assignment, yielding notable gains on challenging mathematical datasets such as MathVista (+1.9%) and MathVision (+3.91%).
The paper reports specific numerical gains: MathVista (+1.9%) and MathVision (+3.91%) in Table 2. These are concrete quantitative results comparing RLSD to GRPO.
已证实 (85%) RLSD also outperforms the additive fusion methods GRPO+OPSD by a margin of 3.27 points.
The paper reports a specific quantitative margin: RLSD outperforms GRPO+OPSD by 3.27 points in average accuracy. This is a concrete numerical result from Table 2.
已证实 (80%) RLSD at 200 steps already surpasses GRPO trained for twice as many steps, demonstrating faster convergence.
This finding is shown in Figure 5 and described in p_1 and p_74. The figure shows training dynamics over 200 steps, demonstrating that RLSD at 200 steps surpasses GRPO at 400 steps. However, exact numerical values at these checkpoints are not explici
已证实 (85%) OPSD reaches its peak performance early and degrades, whereas RLSD inherits the training stability of GRPO while achieving a higher convergence ceiling.
This finding is shown in Figure 5(a) and described in p_1 and p_74. The figure clearly shows OPSD's early peak followed by degradation, while RLSD maintains stability with higher convergence ceiling.
已证实 (85%) Figure 3(a) tracks the frequency of privileged information leakage over 100 training steps and reveals a monotonically increasing trend: the model becomes progressively more reliant on information it cannot access at test time.
Figure 3(a) is explicitly referenced showing the monotonically increasing trend of privileged information leakage over 100 training steps. The figure provides quantitative visualization of this trend.
已证实 (85%) Figure 3(b) reports the corresponding validation accuracy, which peaks within the first 10-20 steps and subsequently declines, consistent with the escalating leakage.
Figure 3(b) is explicitly referenced showing validation accuracy peaking within first 10-20 steps then declining. The figure provides quantitative visualization of this pattern.
已证实 (85%) Under OPD, the KL divergence decreases steadily throughout training, reflecting genuine convergence. Under OPSD, the divergence drops briefly in the first few steps but then plateaus at a level comparable to its initial value, exhibiting no sustained reduction.
Figure 3(c) is explicitly referenced showing the KL divergence patterns: OPD shows steady decrease while OPSD plateaus after initial drop. The figure provides quantitative visualization.
已证实 (80%) Figure 5(c) confirms that the clipped credit assignment mechanism is actively engaged, with clip ratios stabilizing around 3%-6%, successfully bounding the teacher's per-token influence.
Figure 5(c) is explicitly referenced showing clip ratios stabilizing around 3%-6%. This provides quantitative evidence that the clipping mechanism is actively engaged.
已证实 (80%) GRPO suffers from rapid entropy collapse due to its uniform sequence-level reward, whereas RLSD maintains a consistently higher entropy level by selectively strengthening critical reasoning tokens without uniformly suppressing alternatives at every position.
Figure 5(b) is explicitly referenced showing entropy patterns: GRPO shows rapid collapse while RLSD maintains higher entropy. The figure provides quantitative visualization.
已证实 (85%) All three variants confirm the prediction: leakage increases in every case.
Figure 3(a) shows leakage counts for all three OPSD variants (Full OPSD, Teacher's Top-1, Student's Top-1), all exhibiting increasing leakage over training. This is quantitative evidence from controlled experiments.
已证实 (85%) We train our models on MMFineReason-123K, a challenging subset derived from the MMFineReason-1.8M corpus via difficulty-based filtering.
The design choice is clearly specified with justification: MMFineReason-123K is derived via difficulty-based filtering where samples failing all 4 rollouts of Qwen3-VL-4B-Thinking are retained. The paper explains this concentrates training signal on
已证实 (80%) For GRPO, GRPO+OPSD, and RLSD, the learning rate is fixed at 1 × 10^-6; for OPSD and SDPO, the learning rate is set to 1 × 10^-5 following their original implementations.
The learning rates are clearly specified with justification: OPSD and SDPO use 1e-5 'following their original implementations', while GRPO-based methods use 1e-6. This is justified by prior work.
证据不足 (60%) The training batch size is set to 256, and for each prompt, we sample 8 rollouts with a sampling temperature of 1.0.
The hyperparameters (batch size 256, 8 rollouts, temperature 1.0) are stated but no justification is provided for these specific values. No ablation or principled argument is given for these choices.

... 共 40 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

局限性(作者自述)

本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv

分析时间:2026-04-10T01:26:31+00:00 · 数据来源:Paper Collector