NUMINA improves numerical alignment in text-to-video diffusion models through a training-free approach that detects count discrepancies via DiT attention analysis and corrects them through layout refinement. On CountBench, it achieves 7.4% higher accuracy on a 1.
核心问题
How can text-to-video diffusion models be improved to accurately align the number of generated objects with numerical tokens in text prompts?
核心方法
{'approach': 'NUMINA employs a two-phase training-free pipeline: first, it identifies numerical misalignment by selecting instance-discriminative attention heads and constructing countable layouts through clustering; second, it guides regeneration via layout refinement (adding or removing instances) and attention modulation (suppression for removal, boosting for addition) to enforce correct object counts.', 'key_components': ['Video editing methods focus on motion control, style transfer, and appearance editing but overlook instance-level addition.', 'VideoGrain supports multi-region editing through attention modulation.', 'Methods like OmnimatteZero and DiffuEraser handle object removal via video inpainting.', 'Existing approaches fail to align textual numerals with visual content and typically require video-to-video settings with segmentation masks.', 'NUMINA employs a two-phase pipeline with an identify-then-guide paradigm.', 'The first phase performs pre-generation to establish scene layout localization.', 'The second phase re-generates video through modified layout guidance.', 'The framework transforms implicit attention into explicit layout signals for count-accurate generation.'], 'section_ids': ['sec_4', 'sec_7']}
论点验证
The paper provides quantitative evidence in Table 1 showing NUMINA improves counting accuracy (CountAcc) across all model sizes while maintaining or improving CLIP scores and Temporal Consistency. The training-free nature is clearly demonstrated thro
The two-phase identify-then-guide paradigm is fully described in Sections 4.1 and 4.2. Quantitative results in Table 1 demonstrate improved counting accuracy while Table 1 also shows maintained or improved CLIP scores and Temporal Consistency, suppor
The benchmark is described in paragraph 46 with specific details: 210 prompts, counts 1-8, 1-3 object categories. The creation process using GPT-5 followed by manual review is documented.
The paper demonstrates through Figures 2 and 4 that attention patterns reveal instance-level information. The methodology in Section 4.1 shows how attention heads can be selected to identify instances, and Table 2 validates that attention-derived lay
While Figure 2 is cited as visual evidence showing diffuse cross-attention for numerical tokens, the claim lacks quantitative measurement of 'diffuseness'. For a finding claim, visual observation alone without quantitative metrics (e.g., entropy, spr
This claim about downsampled latent space limiting separability is stated as an assertion without empirical demonstration. The paper does not conduct experiments to prove this specific limitation - no ablation on latent space resolution or comparison
Numbers are precisely verifiable in Table 1: Wan2.1-1.3B improves from 42.3% to 49.7% (7.4% gain), and 14B improves from 53.6% to 59.1% (5.5% gain). The arithmetic is correct.
Table 1 shows consistent CLIP score increases across all model sizes: 1.3B (33.9→35.6), 5B (34.6→35.3), 14B (35.2→35.6). The interpretation about 'cleaner scene layouts' is plausible though not directly measured.
The methodology is fully specified in Section 4.1 with concrete parameters: timestep t*=20 and layer ℓ*=15 for early denoising attention extraction.
The head selection process with discriminability scoring is fully specified in paragraphs 20-21, and the clustering algorithm for segmentation is described in paragraph 26 with references to mean shift and density-based clustering.
The two-phase pipeline is clearly described in Section 4 with Phase 1 (identification) in Section 4.1 and Phase 2 (guidance) in Section 4.2.
The selection criteria for self-attention (discriminability score) and cross-attention (peak activation) are fully specified in paragraphs 20-23, with the fusion process described in paragraphs 25-29.
While Figure 4(a) is cited as visual evidence, this finding lacks quantitative measurement of 'head-wise diversity', 'spatial focus', 'category selectivity', or 'instance separability'. For a finding claim, visual observation alone without quantitati
The design choice is fully specified with concrete parameters: t*=20 and ℓ*=15, as stated in paragraph 49.
All three scores are fully specified in paragraph 20 with clear definitions: S1 (standard deviation of intensities), S2 (variance across blocks), S3 (Sobel gradient magnitude).
The formula is explicitly stated in paragraph 20-21 with the weighted sum structure and the role of γ clearly explained.
The selection criterion for cross-attention heads is fully specified in paragraph 22 with the argmax formula based on peak activation.
The clustering-based partitioning is described in paragraph 26 with reference to mean shift clustering.
The 0.1 peak-ratio threshold is explicitly specified in paragraph 26 for suppressing low values in cross-attention maps.
The threshold τ for semantic overlap score is described in paragraph 27-28 for retaining valid instances.
... 共 53 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- No code repository available - core algorithm implementation details for attention extraction and layout guidance modification are not provided
- CountBench benchmark with 210 prompts is not publicly available
- Random seeds for noise vector sampling not specified
- Exact attention extraction methodology and normalization procedures not detailed
- Threshold values for numerical misalignment identification not provided
- GroundingDINO model version, configuration, and confidence thresholds for object detection not specified
- CLIP model version used for evaluation not specified
- Exact text prompts used for GroundingDINO category-specific detection not provided
- Layer indexing convention for Wan T2V model not clarified (which layer corresponds to ℓ*=15)
- Details on how scene layout is extracted from attention maps not fully specified
局限性(作者自述)
- While NUMINA significantly improves numerical alignment, achieving perfect accuracy across all scenarios remains challenging.
- Generating very dense instances (e.g., tens or hundreds) remains unexplored.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-19T07:13:26+00:00 · 数据来源:Paper Collector