The paper identifies Mean Mode Screaming causing mean-dominated collapse in ultra-deep Diffusion Transformers. The proposed MV-Split Residuals decouple mean and centered gradient updates, achieving stable training in 400-1000 layer models without convergence cost.
核心问题
What causes mean-dominated collapse states in ultra-deep Diffusion Transformers, and how can this failure mode be prevented without sacrificing convergence speed?
核心方法
{'approach': 'The authors use a stripped-down single-stream DiT with Post-Norm residual chains and zero-initialized residual writers, trained with Rectified Flow on ImageNet-2012 latents. They analyze the geometric asymmetry in token-space decomposition and propose MV-Split Residuals, which applies orthogonal projectors to decouple mean-coherent and centered gradient updates with independent learnable gains (α, β).', 'key_components': ['MV-Split decouples the rank-one mean-coherent gradient update from the centered update using orthogonal projectors J and P.', 'The method applies separate learnable gains (α, β) to mean and centered subspaces at the residual merge.', 'Forward dynamics: centered subspace follows standard residual update with gain β, mean subspace becomes a per-feature leaky integrator with gain α.', 'Backward dynamics: centered and mean-coherent gradients receive independent gains, damping mean-coherent updates without tying centered branch-gradient to small mean gain.', 'Unlike LayerScale and ReZero, MV-Split distinguishes between mean and centered subspaces rather than suppressing both jointly.', 'Multiple interventions targeting related objects failed to prevent MMS because they do not combine local mean/centered branch-gradient control with forward leaky mean replacement.', 'Hard centering removes useful global information including image-level context and implicit timestep signal.', 'Attention-only controls leave the FFN branch uncontrolled, allowing the spike to relocate to the ungated FFN branch.', 'Gradient clipping cannot rotate mean-coherent writer updates back into the centered subspace, leaving the feature-learning path starved.', 'Muon optimizer reshapes singular values in parameter space but does not implement the token-space split or forward leaky mean replacement used by MV-Split.'], 'section_ids': ['sec_11', 'sec_30']}
论点验证
The paper provides strong quantitative evidence for the mean-dominated collapse state. Figure 3 traces the divergence sequence with specific metrics (token cosine similarity, centered retention, Q/K gradients). Figure 7 and Appendix B show the collap
MMS is clearly defined as a terminology contribution. The paper provides empirical evidence of the abrupt entry event in Figure 3 and Figure 4, showing the gradient spike at step t⋆=3400, the mean-coherent gradient amplification, and subsequent Q/K g
This is a multi-part mechanistic contribution with mathematical proofs and empirical validation. Proposition 1 (row-stochastic attention preserves pure-mean states) is proven. The gradient decomposition (Eq. 5) is proven in Appendix C.1 with complete
MV-Split is fully specified in Eq. 7 and Section 5. The method combines a separately gained centered residual update (β ⊙ PF^l) with a leaky trunk-mean replacement ((1-α) ⊙ JX^l + α ⊙ JF^l). The forward and backward dynamics are derived in Eqs. 8-10.
Quantitative evidence is provided in Figure 5 and Table 1. For 400-layer runs, MV-Split achieves FID-50K of 38.9 at 50k steps vs LayerScale's 52.6. Table 1 shows MV-Split removes collapse events (no divergence) while un-stabilized baselines diverge.
This is a design choice with clear rationale. The paper explains that the stripped-down single-stream DiT is used so that deep residual propagation remains the dominant carrier. This is a methodological decision that doesn't require empirical validat
Design choice with clear rationale. The Post-Norm residual chain without AdaLN is specified to avoid introducing alternative depthwise control channels that would complicate attribution. This is a methodological decision appropriate for the mechanist
Design choice with clear specification. The concatenation approach for multimodal tokens is fully described. This is a methodological decision for the architecture.
Design choice with clear specification. The 2D RoPE extension for image tokens and no rotary encoding for text tokens is described with reference to prior work.
Design choice with clear specification and justification based on prior practice. The zero-initialization of residual writers is described with references to related work.
Appendix B (paragraphs 75-78) provides quantitative evidence. Figure 7 shows standard initialization entering the mean-dominated regime with high token similarity from the beginning, forming a depth-wise collapse front. The paper reports specific met
This is a mathematical proposition that follows directly from the definition of row-stochastic matrices. If A1 = 1 (row-stochastic), then Aμ(X) = A(JX) = JX = μ(X) because J = (1/T)11^T and A1 = 1. The proposition is stated in the paper and is mathem
Mathematical proposition with clear derivation. c(AX) = PAX = PAPX follows from the properties of projectors P and J. The bound ∥c(AX)∥ ≤ μ_eff(A)∥c(X)∥ follows from the definition of μ_eff as the spectral norm of PAP.
Figure 3 traces the divergence sequence with specific measurements. The paper shows the gradient spike concentrated in the mean-coherent component while Q/K gradients collapse. The evidence is quantitative with specific metrics tracked over time.
The gradient decomposition is mathematically proven in Appendix C.1 (Eq. 5). The O(T) coherent regime is derived in Eq. 6 and Appendix C.2. Figure 4 empirically validates this with measurements showing A^-1 ≈ 167 corresponding to ~13× amplification.
Lemma 1 provides the mathematical basis for this mechanism. The proof in Appendix C.3 shows that when values homogenize (V_j = v for all j), the attention-logit gradient vanishes due to the Softmax Jacobian null space. The paper connects this to Q/K
Mathematical lemma with complete proof in Appendix C.3. The proof shows that when V_j = v for all j, ∂L/∂a_i ∝ 1, and since J_sm(a_i)1 = 0, the logit gradient vanishes.
The MV-Split merge equation is fully specified in Eq. 7. The formula Z^l = RMSNorm((1-α)⊙JX^l + α⊙JF^l + β⊙PF^l) is clearly presented with explanation of α, β as per-block learnable vectors.
Design choice with rationale explained in the main text and Appendix E. The segment-wise projectors avoid mixing image and text means in the residual control path.
The experimental setup is clearly specified in the paper. Table 3 provides architecture and training hyperparameters. The matched comparison conditions are described.
... 共 49 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- No code repository available
- No data or pretrained models available
- Critical hyperparameters missing from main text (learning rate, batch size, optimizer settings, weight decay, total training steps)
- Appendix G referenced for 'detailed training configuration' but not accessible
- Random seeds not specified
- Hardware/environment specifications not provided (GPU type, memory, distributed training setup)
- Model architecture details missing (hidden dimensions, attention heads, MLP dimensions, total parameters)
- Initialization values for learnable vectors α and β not specified
- LayerScale λ_init sweep values for baseline comparison not provided
- Details of ~50k curated image set for 1000-layer post-training not specified
局限性(作者自述)
- The alignment-amplification law in Eq. 6 characterizes when token-wise writer gradients stop canceling and enter a coherent accumulation regime. This provides a mechanistic diagnostic for the MMS transition, but it does not by itself predict the exact training step t^⋆ at which an un-stabilized run will cross the critical regime before the run is observed.
- Several parts of our analysis use Transformer-specific structure. In particular, row-stochastic attention preserves pure-mean token states (Proposition 1), and value homogenization suppresses Q/K logit gradients through the null space of the Softmax Jacobian (Lemma 1). These arguments do not directly transfer to attention-free sequence mixers such as convolutional diffusers or state-space models such as Mamba.
- Our scale validation focuses on image and text-to-image diffusion. Video, 3D, and other spatiotemporal generators often operate with substantially longer token sequences and additional structure across time, views, or modalities.
- These numbers are not intended as a controlled comparison to large public text-to-image systems: our model is trained on substantially smaller and differently sourced data (ImageNet-2012 pretraining followed by SFT and DPO on ~50k curated images), uses a shorter training schedule, and uses a simpler post-training pipeline.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-05-12T01:16:57+00:00 · 数据来源:Paper Collector