SDAR reconciles RL and on-policy self-distillation for multi-turn LLM agents via adaptive token-level gating, achieving +9.4% on ALFWorld, +7.0% on Search-QA, and +10.2% on WebShop over GRPO.
核心问题
How can reinforcement learning be effectively combined with on-policy self-distillation for multi-turn LLM agent training, given that OPSD suffers from compounding errors and unreliable token-level supervision in multi-turn settings?
核心方法
{'approach': 'SDAR combines GRPO policy loss with an auxiliary OPSD objective modulated by adaptive token-level gates using sigmoid functions on teacher-student divergence or entropy signals. The gate is detached via stop-gradient to ensure stable training, and three gating strategies (entropy, gap, soft-OR) are evaluated. Experiments are conducted on ALFWorld, WebShop, and Search-QA using Qwen2.5-Instruct and Qwen3-Instruct models trained for 150 steps on 8 H800 GPUs.', 'key_components': ['Recent hybrid methods combine RL with distillation but suffer from rigid scheduling or unstable updates.', 'The proposed method treats distillation as a separate auxiliary objective with adaptive token-level gating.', 'The approach preserves the unbiasedness of the RL advantage.', 'Only beneficial teacher signals are selectively injected through the gating mechanism.'], 'section_ids': ['sec_2', 'sec_14']}
论点验证
The paper fully specifies the SDAR method with sigmoid gate mechanism (Equations 16-17, 27-33, Algorithm 1), provides theoretical analysis (Propositions 1-5 in Appendix A), and validates it experimentally across three benchmarks with three model scal
The token-level gate is mathematically defined (g_t ∈ [0,1] in Equation 27-28), its application to the sampled-token surrogate is specified (Equation 33), and the paper shows how different gating strategies (entropy, gap, soft-OR) share the same opti
All three gating strategies are explicitly defined with mathematical formulas in p_30-p_32, and Figure 6 provides comparative evaluation showing gap gating achieves ~0.84 asymptotic success rate, outperforming the alternatives.
The paper explicitly shows L_SDAR as an auxiliary term added to L_GRPO (Equation 16-17), and Propositions 1-5 in Appendix A provide theoretical analysis of gradient properties, confirming the RL advantage semantics are preserved.
The sigmoid gate (Equation 27-28) provides adaptive, smooth modulation. The paper contrasts this with rigid schedules in TCOD, Skill-SD, and HDPO (p_2, p_6). Figure 5 shows the adaptive behavior of the gate during training.
The gate is explicitly defined using token-level signals: entropy h_t and teacher-student gap Δ_t (p_28-p_32). The philosophy of token-level self-regulation is stated and implemented through the sigmoid gate.
Specific quantitative improvements are stated in p_7: +9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc for 7B model. These are concrete numerical results comparing SDAR to GRPO baseline.
The paper states SDAR 'entirely avoids the catastrophic instability' and references Figure 2 (Left) showing GRPO+OPSD degradation. However, without visible training curves or quantitative stability metrics (variance, failure rates), the claim of 'ent
Specific comparative numbers are provided: Qwen2.5-3B achieves 84.4 vs 73.4 (Skill-SD) and 79.7 (RLSD) on ALFWorld (p_42); Qwen3-1.7B achieves 53.9% vs 42.2% (RLSD) (p_42). Quantitative evidence across model scales.
The claim references Figure 3 showing negative-gap tokens exceed 50%, but the actual figure data isn't visible in the provided text. The specific percentage isn't given, only 'exceeds 50%'. Without the exact figure or more precise quantification, evi
Specific quantitative evidence in p_44: Random Retrieval yields gains of +1.9/+1.6/+1.0 on ALFWorld/WebShop-Score/WebShop-Acc over GRPO baseline. Concrete numbers support the claim.
Specific numbers provided: Skill-GRPO drops from 80.5 to 60.2 on ALFWorld-3B when tested without skills (p_41). SDAR's inference without skills is stated as a design property.
Specific numbers provided: SDAR achieves 84.4 on ALFWorld-3B and 53.9 on ALFWorld-1.7B, surpassing Skill-GRPO* (p_41). Quantitative evidence supports the claim.
Specific numbers for ALFWorld: 84.4 vs 73.4 (Skill-SD) and 79.7 (RLSD) on Qwen2.5-3B (p_42). WebShop comparison is mentioned but without specific numbers, slightly weakening the claim.
Specific numbers for Qwen3-1.7B: Skill-GRPO 21.1%, GRPO 46.1%, RLSD 42.2%, SDAR 53.9% on ALFWorld (p_42). Complete quantitative comparison provided.
The claim references Figure 5(a) for evidence that the mean Teacher-Student gap remains consistently negative. Without the actual figure data or numerical values showing the gap trajectory, the claim cannot be fully verified from the text alone.
The claim about Δ converging toward zero references Figure 5(a), which isn't visible. No numerical convergence data is provided in the text. The claim describes figure behavior without supporting numbers.
The claim about gate activation ratio being below 0.5 references Figure 5(b), which isn't visible. No specific numerical data about the ratio trajectory is provided in the text.
Specific numbers provided: Random Retrieval gains of +1.9/+1.6/+1.0 on ALFWorld/WebShop-Score/WebShop-Acc (p_44). The claim that all four strategies outperform GRPO is stated with reference to Table 2.
Specific numbers provided: Keyword Matching achieves +4.7/+8.5/+10.2 gains and surpasses UCB on WebShop (p_44). Quantitative evidence supports the claim.
... 共 41 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- No code repository available - implementation details not accessible
- No training/evaluation datasets provided or specified
- Missing all hyperparameters: learning rate, batch size, number of epochs/training steps, optimizer settings
- No model architecture details (base model, size, number of parameters)
- No specification of RL algorithm used (PPO, A3C, etc.)
- Missing details on token-level gating mechanism implementation (how is it adaptive and bounded?)
- No information about privileged context c+ construction and what auxiliary information is used
- No random seeds specified for reproducibility
- No hardware/computational specifications provided
- No evaluation environments/tasks specified
局限性(作者自述)
- λ = 0.001 exerts insufficient corrective pressure to meaningfully aid the RL process, confirming the necessity of a carefully calibrated, moderate coefficient.
- an excessively small β (including the no-gate baseline) applies distillation indiscriminately, thereby inheriting the multi-turn instability of naïve OPSD
- an overly large β strictly binarizes the gate, stripping away the smooth modulation necessary for assigning partial credit on borderline tokens.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-05-16T07:27:29+00:00 · 数据来源:Paper Collector