TL;DR
NUMINA improves numerical alignment in text-to-video diffusion models through a training-free approach that detects count discrepancies via DiT attention analysis and corrects them through layout refinement. On CountBench, it achieves 7.4% higher accuracy on a 1.
47
已证实
4
证据不足
2
无法验证
N/A
可复现性
置信度
86%

核心问题

How can text-to-video diffusion models be improved to accurately align the number of generated objects with numerical tokens in text prompts?

核心方法

{'approach': 'NUMINA employs a two-phase training-free pipeline: first, it identifies numerical misalignment by selecting instance-discriminative attention heads and constructing countable layouts through clustering; second, it guides regeneration via layout refinement (adding or removing instances) and attention modulation (suppression for removal, boosting for addition) to enforce correct object counts.', 'key_components': ['Video editing methods focus on motion control, style transfer, and appearance editing but overlook instance-level addition.', 'VideoGrain supports multi-region editing through attention modulation.', 'Methods like OmnimatteZero and DiffuEraser handle object removal via video inpainting.', 'Existing approaches fail to align textual numerals with visual content and typically require video-to-video settings with segmentation masks.', 'NUMINA employs a two-phase pipeline with an identify-then-guide paradigm.', 'The first phase performs pre-generation to establish scene layout localization.', 'The second phase re-generates video through modified layout guidance.', 'The framework transforms implicit attention into explicit layout signals for count-accurate generation.'], 'section_ids': ['sec_4', 'sec_7']}

论点验证

已证实 (90%) We propose NUMINA, a training-free video generation framework that enhances numerical alignment in T2V generation while preserving visual fidelity and temporal coherence.
The paper provides quantitative evidence in Table 1 showing NUMINA improves counting accuracy (CountAcc) across all model sizes while maintaining or improving CLIP scores and Temporal Consistency. The training-free nature is clearly demonstrated thro
已证实 (85%) NUMINA introduces an identify-then-guide paradigm, which yields accurate cardinalities and retains appearance, motion, and semantics.
The two-phase identify-then-guide paradigm is fully described in Sections 4.1 and 4.2. Quantitative results in Table 1 demonstrate improved counting accuracy while Table 1 also shows maintained or improved CLIP scores and Temporal Consistency, suppor
已证实 (85%) We introduce the CountBench benchmark, comprising 210 prompts covering counts from 1-8 for scenes involving 1-3 object categories.
The benchmark is described in paragraph 46 with specific details: 210 prompts, counts 1-8, 1-3 object categories. The creation process using GPT-5 followed by manual review is documented.
已证实 (80%) We reveal that the attentions in T2V models expose critical visual information related to the number of instances.
The paper demonstrates through Figures 2 and 4 that attention patterns reveal instance-level information. The methodology in Section 4.1 shows how attention heads can be selected to identify instances, and Table 2 validates that attention-derived lay
证据不足 (60%) Numerical tokens exhibit diffuse cross-attention responses compared to other word types.
While Figure 2 is cited as visual evidence showing diffuse cross-attention for numerical tokens, the claim lacks quantitative measurement of 'diffuseness'. For a finding claim, visual observation alone without quantitative metrics (e.g., entropy, spr
证据不足 (50%) The heavily downsampled spatiotemporal latent space in DiT-based architectures limits the separability of individual object representations, making stable count control difficult.
This claim about downsampled latent space limiting separability is stated as an assertion without empirical demonstration. The paper does not conduct experiments to prove this specific limitation - no ablation on latent space resolution or comparison
已证实 (95%) On CountBench, NUMINA improves by 7.4% counting accuracy on Wan2.1-1.3B and by 5.5% on a larger 14B model.
Numbers are precisely verifiable in Table 1: Wan2.1-1.3B improves from 42.3% to 49.7% (7.4% gain), and 14B improves from 53.6% to 59.1% (5.5% gain). The arithmetic is correct.
已证实 (85%) We observe a consistent increase in CLIP score for various baselines, suggesting that enforcing correct instance counts strengthens overall text-video alignment and yields cleaner scene layouts.
Table 1 shows consistent CLIP score increases across all model sizes: 1.3B (33.9→35.6), 5B (34.6→35.3), 14B (35.2→35.6). The interpretation about 'cleaner scene layouts' is plausible though not directly measured.
已证实 (90%) In the first phase, NUMINA operates early during denoising to detect misalignment between numeral tokens and the evolving latent layout.
The methodology is fully specified in Section 4.1 with concrete parameters: timestep t*=20 and layer ℓ*=15 for early denoising attention extraction.
已证实 (90%) NUMINA performs a dynamic selection of attention heads using an object discriminability criterion, then applies a cluster-based algorithm to obtain precise segmentation.
The head selection process with discriminability scoring is fully specified in paragraphs 20-21, and the clustering algorithm for segmentation is described in paragraph 26 with references to mean shift and density-based clustering.
已证实 (90%) We utilize a two-phase pipeline for the training-free framework, following an identify-then-guide paradigm.
The two-phase pipeline is clearly described in Section 4 with Phase 1 (identification) in Section 4.1 and Phase 2 (guidance) in Section 4.2.
已证实 (90%) We select the most instance-discriminative self-attention head and the most text-concentrated cross-attention head, and then fuse their maps to obtain an instance-level layout that is explicitly countable.
The selection criteria for self-attention (discriminability score) and cross-attention (peak activation) are fully specified in paragraphs 20-23, with the fusion process described in paragraphs 25-29.
证据不足 (55%) We observe substantial head-wise diversity in spatial focus, category selectivity, and instance separability.
While Figure 4(a) is cited as visual evidence, this finding lacks quantitative measurement of 'head-wise diversity', 'spatial focus', 'category selectivity', or 'instance separability'. For a finding claim, visual observation alone without quantitati
已证实 (90%) At a reference timestep t⋆ during the pre-generation trajectory, we select attention heads from an intermediate layer ℓ⋆.
The design choice is fully specified with concrete parameters: t*=20 and ℓ*=15, as stated in paragraph 49.
已证实 (90%) We design three complementary scores to measure the separability: 1) Foreground-background separation S1, 2) Structural richness S2, 3) Edge clarity S3.
All three scores are fully specified in paragraph 20 with clear definitions: S1 (standard deviation of intensities), S2 (variance across blocks), S3 (Sobel gradient magnitude).
已证实 (90%) The overall discriminability score for head h is a weighted sum formed as: S(SA_h) = S1_h + S2_h + γ·S3_h, where γ > 0 balances the contribution of edge clarity against the global contrast and intermediate-scale structure.
The formula is explicitly stated in paragraph 20-21 with the weighted sum structure and the role of γ clearly explained.
已证实 (90%) For each target noun token T in the prompt, we select its best cross-attention head h*_c(T) = argmax_h C^h_T based on peak activation.
The selection criterion for cross-attention heads is fully specified in paragraph 22 with the argmax formula based on peak activation.
已证实 (90%) Spatial proposals are generated by partitioning the self-attention map A_s into contiguous regions using clustering.
The clustering-based partitioning is described in paragraph 26 with reference to mean shift clustering.
已证实 (90%) A_c,T is processed by suppressing values below a 0.1 peak-ratio threshold to isolate peak responses.
The 0.1 peak-ratio threshold is explicitly specified in paragraph 26 for suppressing low values in cross-attention maps.
已证实 (90%) A region is retained as a valid instance if the semantic overlap score S_o ≥ τ.
The threshold τ for semantic overlap score is described in paragraph 27-28 for retaining valid instances.

... 共 53 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

局限性(作者自述)

本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv

分析时间:2026-04-19T07:13:26+00:00 · 数据来源:Paper Collector