DMax enables aggressive parallel decoding in diffusion language models through On-Policy Uniform Training and Soft Parallel Decoding, addressing error accumulation. It improves tokens per forward from 2.8 to 6.2 while preserving accuracy, achieving over 1000 tokens/second throughput.
核心问题
How can diffusion language models achieve aggressive parallel decoding without suffering from error accumulation that degrades accuracy?
核心方法
{'approach': "The paper proposes DMax with two components: On-Policy Uniform Training (OPUT) extends pretrained masked diffusion models into self-corrective uniform diffusion models by sampling noisy sequences on-policy from the model's predictive distribution. Soft Parallel Decoding (SPD) represents intermediate decoding states as hybrid soft embeddings interpolated between predicted token and mask embeddings based on prediction confidence.", 'key_components': ['MDLMs train models to recover clean tokens from masked positions through a discrete denoising process.', 'UDLMs train models to denoise from arbitrary vocabulary tokens, enabling self-correction at all positions.', 'MDLMs suffer from error accumulation because decoded tokens become fixed context that cannot be revised.', "The proposed approach combines stable masked initialization with UDLM's self-revising capability."], 'section_ids': ['sec_5', 'sec_6']}
论点验证
The paper provides detailed description of DMax's architecture (p_9, p_33-p_50), mathematical formulation of hybrid embeddings (Eqs. 7-10), and experimental validation demonstrating the self-revising capability through iterative refinement with soft
OPUT is described in detail (p_23-p_30) with mathematical formulation (Eqs. 4-6), training procedure, and experimental validation. The dual supervision on both masked and predicted sequences demonstrates preservation of mask denoising while adding se
SPD is fully specified with mathematical formulation (Eqs. 7-10 in p_37-p_40), decoding algorithm (p_34-p_48), and experimental validation. The hybrid embedding interpolation between predicted token and mask embedding based on confidence is clearly d
Specific TPF numbers (2.04 to 5.48) are provided in p_10 and referenced to Table 1. The claim is directly stated with quantitative evidence. Minor gap: 'minimal accuracy degradation' is qualitative without specific accuracy numbers for this compariso
Specific TPF numbers (2.71 to 5.86) are provided in p_10 for MBPP benchmark. The claim is directly stated with quantitative evidence. Similar to claim 4, 'comparable performance' is qualitative but the main TPF improvement is quantified.
The unification approach is clearly described in p_18, explaining how masked initialization preserves stability while token-to-token denoising enables self-correction. The approach is validated through experiments comparing to both MDLM and UDLM base
OPUT's core mechanism is clearly specified in p_23-p_30 with mathematical formulation. The on-policy sampling from model's own distribution (Eq. 4) and dual supervision (Eq. 6) are well-described. Experimental validation shows OPUT outperforms conven
This is a clearly specified design choice in p_24 with mathematical notation. The corruption level sampling and masked sequence construction procedure are fully described.
The on-policy nature is explicitly stated in p_26: 'x(p)_t is sampled using the current model parameters at each iteration, making this a strictly on-policy rollout process.' This is a clear specification of the training procedure.
The two forward passes are specified in p_27 with clear description of using masked noisy sequence and predicted noisy sequence as inputs respectively.
The loss function is specified in p_28-p_29 with cross-entropy over all token positions regardless of masking status. The final training objective (Eq. 6) combines both losses.
Specific accuracy numbers (78% to 90%) are provided in p_30 for GSM8K with threshold 0.5. The claim includes both accuracy improvement and faster decoding. Self-reported result without error bars.
Soft parallel decoding is fully described in p_33-p_50 with mathematical formulation of hybrid embeddings (Eqs. 7-10). The approach is validated through experiments and ablations showing its effectiveness.
The confidence threshold promotion strategy is clearly specified in p_34 with algorithmic description of scanning masked region left-to-right and promoting longest contiguous prefix exceeding threshold.
The rationale for contiguous masked region is provided in p_34. Additionally, p_61 mentions 'Maintaining the non-masked region as a contiguous prefix further improves performance' in the ablation study, providing empirical support for this design cho
The hybrid embedding construction from top-1 prediction is specified in p_37 with mathematical notation.
The renormalization procedure is described in p_39-p_40 with rationale about avoiding norm collapse. The mathematical formulation is provided.
The convergence criteria are clearly specified in p_48 with two conditions: (1) unchanged predictions for two consecutive steps, or (2) all confidences exceeding τ_acc.
Specific accuracy numbers (68% to 90%) are provided in p_50 for GSM8K with τ_dec=0. This demonstrates SPD's effectiveness under highly aggressive parallel decoding. Self-reported without error bars.
The prerequisite relationship is explained in p_51 with rationale that OPUT trains consistent mapping from both mask and predicted embeddings. Empirical evidence in p_61 (Table 3) shows SPD causes collapse without OPUT.
... 共 35 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- No code available - neither official repository nor pseudocode for OPUT training method or SPD decoding strategy
- No training data available - exact subsets of Numina-Math, OpenThoughts, and OpenCodeInstruct not specified
- OPUT training method not explained - acronym undefined, implementation details missing
- SPD decoding strategy not described - algorithm details not provided
- Random seeds not specified for training or evaluation
- Exact data filtering criteria for self-distillation not specified (how subsets were selected)
- Detailed hyperparameters for baseline methods not fully specified
- LLaDA-2.0-mini base model access and exact version unclear
- dInFer evaluation framework configuration details missing
- Tensor parallelism configuration details not specified
局限性(作者自述)
- we do not use any external high-quality responses, all supervision is obtained from the model's own generations
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-26T07:24:12+00:00 · 数据来源:Paper Collector