Identifies why on-policy self-distillation fails in RLVR (irreducible information gaps causing leakage) and proposes RLSD, which decouples update direction from magnitude. RLSD achieves 4.69% improvement over base models with faster convergence and stability.
核心问题
Why does on-policy self-distillation (OPSD) fail in reinforcement learning with verifiable rewards settings, and how can self-distillation be reformulated to provide token-level supervision without causing privileged information leakage?
核心方法
{'approach': "The authors provide theoretical analysis proving OPSD's irreducible mutual information gap causes information leakage, then propose RLSD which repurposes teacher-student discrepancy as a credit assignment signal within GRPO's policy gradient framework. The method uses environment reward for update direction and teacher's evidence ratio for magnitude modulation, evaluated on Qwen3-VL-8B-Instruct across five multimodal reasoning benchmarks.", 'key_components': ["OPSD-trained models systematically reference privileged information unavailable at inference time, such as invisible 'reference solutions'.", 'OPD teacher is defined as an external model, while OPSD teacher is the same model conditioned on privileged information r.', 'Both methods share the training objective of minimizing per-token divergence D between teacher and student distributions.', 'Gradients are backpropagated only through the student distribution P_S, with the teacher P_T serving as a fixed target.', 'Base model is Qwen3-VL-8B-Instruct with comparisons against GRPO, OPSD, SDPO, and GRPO+OPSD baselines.', 'Training uses batch size 256, 8 rollouts per prompt, temperature 1.0, and learning rate 1×10^-6 for GRPO-based methods.', 'RLSD hyperparameters: λ initialized at 0.5 and decayed to 0 over 50 steps, ε_w = 0.2, teacher synced every 10 steps.', "RLSD requires only ground-truth answers as privileged information, less demanding than OPSD's verified reasoning traces."], 'section_ids': ['sec_4', 'sec_17']}
论点验证
The paper provides a complete specification of the RLSD method in Section 4, including the algorithmic formulation (Eqs. 10-15), the three-step procedure (privileged information gain, direction-aware evidence reweighting, clipped credit assignment),
The paper provides both controlled experiments (Figure 3 showing leakage frequency, validation accuracy, and KL divergence patterns) and formal analysis (Theorem 1 proving the irreducible mutual information gap, Proposition 1 characterizing gradient
The proposition is formally stated with properties (i) and (ii). The proof is provided in Appendix A.2 (p_90-p_92), showing the decomposition g(θ; r) = g*(θ) + δ(θ; r) and the variance relationship.
The theorem is stated in p_115 and the analysis is provided in Appendix A.4 (p_102-p_115), examining two strategies (frozen teacher and evolving teacher) and showing neither can satisfy all three properties simultaneously under shared parameters.
This claim directly follows from Theorem 1, which proves the KL decomposition identity showing the irreducible mutual information gap. The theorem and proof are provided, establishing that I(Y_t; R | X, Y_{
The paper reports specific quantitative results in Table 2 (referenced in p_73): RLSD achieves highest average accuracy, outperforming Base LLM by 4.69% and GRPO by 2.32% under the 4K setting. These are concrete numerical results.
The paper reports specific numerical gains: MathVista (+1.9%) and MathVision (+3.91%) in Table 2. These are concrete quantitative results comparing RLSD to GRPO.
The paper reports a specific quantitative margin: RLSD outperforms GRPO+OPSD by 3.27 points in average accuracy. This is a concrete numerical result from Table 2.
This finding is shown in Figure 5 and described in p_1 and p_74. The figure shows training dynamics over 200 steps, demonstrating that RLSD at 200 steps surpasses GRPO at 400 steps. However, exact numerical values at these checkpoints are not explici
This finding is shown in Figure 5(a) and described in p_1 and p_74. The figure clearly shows OPSD's early peak followed by degradation, while RLSD maintains stability with higher convergence ceiling.
Figure 3(a) is explicitly referenced showing the monotonically increasing trend of privileged information leakage over 100 training steps. The figure provides quantitative visualization of this trend.
Figure 3(b) is explicitly referenced showing validation accuracy peaking within first 10-20 steps then declining. The figure provides quantitative visualization of this pattern.
Figure 3(c) is explicitly referenced showing the KL divergence patterns: OPD shows steady decrease while OPSD plateaus after initial drop. The figure provides quantitative visualization.
Figure 5(c) is explicitly referenced showing clip ratios stabilizing around 3%-6%. This provides quantitative evidence that the clipping mechanism is actively engaged.
Figure 5(b) is explicitly referenced showing entropy patterns: GRPO shows rapid collapse while RLSD maintains higher entropy. The figure provides quantitative visualization.
Figure 3(a) shows leakage counts for all three OPSD variants (Full OPSD, Teacher's Top-1, Student's Top-1), all exhibiting increasing leakage over training. This is quantitative evidence from controlled experiments.
The design choice is clearly specified with justification: MMFineReason-123K is derived via difficulty-based filtering where samples failing all 4 rollouts of Qwen3-VL-4B-Thinking are retained. The paper explains this concentrates training signal on
The learning rates are clearly specified with justification: OPSD and SDPO use 1e-5 'following their original implementations', while GRPO-based methods use 1e-6. This is justified by prior work.
The hyperparameters (batch size 256, 8 rollouts, temperature 1.0) are stated but no justification is provided for these specific values. No ablation or principled argument is given for these choices.
... 共 40 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- No code repository available - implementation cannot be verified or reused
- No data access provided - MMFineReason-123K dataset not publicly linked
- Training duration not specified - number of training steps or epochs missing
- Random seeds not reported - critical for reproducibility in RL training
- Software environment versions not specified (PyTorch, CUDA, VERL/EasyR1 versions)
- Data preprocessing and formatting details not fully documented
- Evaluation protocol details missing (exact prompts, few-shot settings)
- Dataset splits for training/validation not specified
- Ambiguous availability statement 'available at inference' - unclear what this means
局限性(作者自述)
- This paper focuses primarily on the theoretical analysis of OPSD's structural limitations and the motivation and validation of the RLSD paradigm. To enable faster release, this version provides limited experiments, validating around multimodal reasoning scenarios.
- However, we have preliminarily validated RLSD across a broader range of settings, including pure text reasoning, video understanding, and additional model families beyond the Qwen series, and observed consistent gains. We will include these in the forthcoming version.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-10T01:26:31+00:00 · 数据来源:Paper Collector