This paper identifies two conditions governing On-Policy Distillation effectiveness: thinking-pattern consistency and new knowledge provision. Successful distillation is driven by progressive alignment on shared high-probability tokens carrying 97%-99% of probability mass, explaining why stronger t…
核心问题
What conditions govern the effectiveness of On-Policy Distillation (OPD) for large language models, and why do stronger teachers not always yield better distillation outcomes?
核心方法
{'approach': 'The authors conduct controlled experiments across multiple model families (Qwen, DeepSeek, R1-Distill), defining dynamic metrics (Overlap Ratio, Overlap-Token Advantage, Entropy Gap) to monitor training dynamics. They perform ablation studies including reverse-distillation experiments and targeted token-region optimization to isolate the mechanisms driving OPD success or failure.', 'key_components': ['The high-probability-token alignment phenomenon generalizes across different model pairs.', 'Successful OPD consistently coincides with emergence of high-probability-token alignment at student-visited states.', 'Failed OPD shows poor or unstable alignment metrics across different teacher-student configurations.'], 'section_ids': ['sec_36']}
论点验证
The paper provides substantial experimental evidence across Sections 3.1, 3.2, and the reverse-distillation experiments to validate both conditions. Multiple controlled comparisons demonstrate that thinking-pattern consistency (higher initial overlap
The paper provides clear mathematical definitions for all three metrics (Overlap Ratio in p16-17, Overlap-Token Advantage in p18-19, Entropy Gap in p20-21) and uses them consistently throughout the experimental analysis. This is a methodological cont
The claim is supported by quantitative experimental results shown in Figure 2, with the paper explicitly stating that distillation from GRPO teacher 'consistently outperforms' the Non-thinking teacher. The paper also notes both teachers have 'broadly
The paper provides quantitative evidence from Figure 2 (right) showing the GRPO teacher has higher initial overlap ratio. This is a direct experimental observation.
The claim is supported by quantitative results in Figure 4, with specific gap recovery rates reported. The paper shows post-trained teachers recover 'a much larger fraction of the teacher-student gap' compared to same-pipeline teachers.
The paper provides quantitative evidence from Figure 5 showing that distilling JustRL-1.5B toward its pre-RL checkpoint causes regression to pre-RL performance. This is a specific experimental result.
The paper provides experimental evidence from Figure 5 showing that using R1-Distill-7B as teacher produces no improvement and causes regression, despite it being a larger and slightly stronger model.
The paper provides specific quantitative results: 'recovering more than 80% of the performance gap' for JustRL-1.5B teacher vs 'fails to yield any improvement' for R1-Distill-7B teacher, with reference to Figure 6.
The paper provides detailed observations of training dynamics from Figure 6 (bottom), describing specific metric behaviors in successful vs failing runs. These are quantitative observations from tracked metrics.
The paper provides specific quantitative measurements: '97%-99% of the total probability mass' with reference to Appendix B.1 and Figure 18. This is a direct measurement from the experiments.
The paper provides ablation study results from Figure 7 showing that Overlap Top-k achieves nearly the same performance as Student Top-k across all three benchmarks, while Non-Overlap Top-k is consistently weaker.
The paper provides specific quantitative values from Figure 7: overlap ratio rises from ~72% to above 91% for Student Top-k and Overlap Top-k, while Non-Overlap Top-k shows different behavior.
The paper presents two strategies in Section 5 with experimental validation: off-policy cold start (Section 5.1, Figure 8) and teacher-aligned prompt selection (Section 5.2, Figures 9-10). Both strategies are shown to improve OPD outcomes in otherwis
The paper provides quantitative results from Figure 8 showing the two-stage SFT+OPD approach consistently outperforms pure OPD starting from base model.
The paper provides observations of overlap dynamics from Figure 8, describing specific differences in trajectory stability between SFT-initialized and base-initialized students.
The paper provides quantitative results from Figure 9 showing teacher-aligned template improves validation performance on all three benchmarks.
The paper observes that teacher-aligned prompts lead to 'substantially lower student entropy' (quantitative observation from experiments). However, the 'suggestion' that this may not be ideal is an interpretation rather than a rigorously proven findi
The paper provides specific quantitative measurements from Figure 11(b): teacher's accuracy advantage decreases from +0.37 at 1K prefix to +0.02 at 16K prefix.
The paper provides observations from Figure 13 showing the back-to-front pattern of entropy propagation during training.
The paper provides specific quantitative AUROC values from Figure 14: 0.73 for JustRL-1.5B and 0.75 for R1-Distill-7B, showing both teachers produce globally informative reward signals.
... 共 39 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- No code repository available - implementation details are not accessible
- No data available - DAPO-Math-17K dataset and prompt templates not provided
- Training hyperparameters not specified (learning rate, batch size, epochs, optimizer, etc.)
- Distillation-specific parameters missing (temperature, loss function formulation, KL divergence weight)
- Random seeds not reported for reproducibility
- Hardware/environment specifications not mentioned (GPU type, memory, framework versions)
- Complete prompt templates referenced but not shown in the provided sections
- Section 4.1 referenced for training/evaluation setup but not provided in the excerpt
- Model architecture details beyond names not specified
- Evaluation metrics implementation details not provided
局限性(作者自述)
- Dense reward is effective on moderately long reasoning traces, but its reliability degrades with depth as the student prefix drifts further from the states familiar to the teacher. This suggests that OPD may not extend cleanly to longer-horizon settings such as extended chain-of-thought or agentic multi-turn interaction.
- We have not directly verified this anisotropy hypothesis, and doing so would require analyzing the directional structure of per-token gradients, which we leave to future work.
- These findings reveal a fundamental tension between supervision density and supervision reliability, and point to the limitations of current OPD for long-horizon reasoning and agentic settings.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-20T13:08:56+00:00 · 数据来源:Paper Collector