When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models - AI 论文深度分析

TL;DR
NUMINA improves numerical alignment in text-to-video diffusion models through a training-free approach that detects count discrepancies via DiT attention analysis and corrects them through layout refinement. On CountBench, it achieves 7.4% higher accuracy on a 1.

已证实

证据不足

无法验证

N/A

可复现性

置信度

86%

核心问题

How can text-to-video diffusion models be improved to accurately align the number of generated objects with numerical tokens in text prompts?

核心方法

{'approach': 'NUMINA employs a two-phase training-free pipeline: first, it identifies numerical misalignment by selecting instance-discriminative attention heads and constructing countable layouts through clustering; second, it guides regeneration via layout refinement (adding or removing instances) and attention modulation (suppression for removal, boosting for addition) to enforce correct object counts.', 'key_components': ['Video editing methods focus on motion control, style transfer, and appearance editing but overlook instance-level addition.', 'VideoGrain supports multi-region editing through attention modulation.', 'Methods like OmnimatteZero and DiffuEraser handle object removal via video inpainting.', 'Existing approaches fail to align textual numerals with visual content and typically require video-to-video settings with segmentation masks.', 'NUMINA employs a two-phase pipeline with an identify-then-guide paradigm.', 'The first phase performs pre-generation to establish scene layout localization.', 'The second phase re-generates video through modified layout guidance.', 'The framework transforms implicit attention into explicit layout signals for count-accurate generation.'], 'section_ids': ['sec_4', 'sec_7']}

论点验证

已证实 (90%) We propose NUMINA, a training-free video generation framework that enhances numerical alignment in T2V generation while preserving visual fidelity and temporal coherence.
The paper provides quantitative evidence in Table 1 showing NUMINA improves counting accuracy (CountAcc) across all model sizes while maintaining or improving CLIP scores and Temporal Consistency. The training-free nature is clearly demonstrated thro

已证实 (85%) NUMINA introduces an identify-then-guide paradigm, which yields accurate cardinalities and retains appearance, motion, and semantics.
The two-phase identify-then-guide paradigm is fully described in Sections 4.1 and 4.2. Quantitative results in Table 1 demonstrate improved counting accuracy while Table 1 also shows maintained or improved CLIP scores and Temporal Consistency, suppor

已证实 (85%) We introduce the CountBench benchmark, comprising 210 prompts covering counts from 1-8 for scenes involving 1-3 object categories.
The benchmark is described in paragraph 46 with specific details: 210 prompts, counts 1-8, 1-3 object categories. The creation process using GPT-5 followed by manual review is documented.

已证实 (80%) We reveal that the attentions in T2V models expose critical visual information related to the number of instances.
The paper demonstrates through Figures 2 and 4 that attention patterns reveal instance-level information. The methodology in Section 4.1 shows how attention heads can be selected to identify instances, and Table 2 validates that attention-derived lay

证据不足 (60%) Numerical tokens exhibit diffuse cross-attention responses compared to other word types.
While Figure 2 is cited as visual evidence showing diffuse cross-attention for numerical tokens, the claim lacks quantitative measurement of 'diffuseness'. For a finding claim, visual observation alone without quantitative metrics (e.g., entropy, spr

证据不足 (50%) The heavily downsampled spatiotemporal latent space in DiT-based architectures limits the separability of individual object representations, making stable count control difficult.
This claim about downsampled latent space limiting separability is stated as an assertion without empirical demonstration. The paper does not conduct experiments to prove this specific limitation - no ablation on latent space resolution or comparison

已证实 (95%) On CountBench, NUMINA improves by 7.4% counting accuracy on Wan2.1-1.3B and by 5.5% on a larger 14B model.
Numbers are precisely verifiable in Table 1: Wan2.1-1.3B improves from 42.3% to 49.7% (7.4% gain), and 14B improves from 53.6% to 59.1% (5.5% gain). The arithmetic is correct.

已证实 (85%) We observe a consistent increase in CLIP score for various baselines, suggesting that enforcing correct instance counts strengthens overall text-video alignment and yields cleaner scene layouts.
Table 1 shows consistent CLIP score increases across all model sizes: 1.3B (33.9→35.6), 5B (34.6→35.3), 14B (35.2→35.6). The interpretation about 'cleaner scene layouts' is plausible though not directly measured.

已证实 (90%) In the first phase, NUMINA operates early during denoising to detect misalignment between numeral tokens and the evolving latent layout.
The methodology is fully specified in Section 4.1 with concrete parameters: timestep t*=20 and layer ℓ*=15 for early denoising attention extraction.

已证实 (90%) NUMINA performs a dynamic selection of attention heads using an object discriminability criterion, then applies a cluster-based algorithm to obtain precise segmentation.
The head selection process with discriminability scoring is fully specified in paragraphs 20-21, and the clustering algorithm for segmentation is described in paragraph 26 with references to mean shift and density-based clustering.

已证实 (90%) We utilize a two-phase pipeline for the training-free framework, following an identify-then-guide paradigm.
The two-phase pipeline is clearly described in Section 4 with Phase 1 (identification) in Section 4.1 and Phase 2 (guidance) in Section 4.2.

已证实 (90%) We select the most instance-discriminative self-attention head and the most text-concentrated cross-attention head, and then fuse their maps to obtain an instance-level layout that is explicitly countable.
The selection criteria for self-attention (discriminability score) and cross-attention (peak activation) are fully specified in paragraphs 20-23, with the fusion process described in paragraphs 25-29.

证据不足 (55%) We observe substantial head-wise diversity in spatial focus, category selectivity, and instance separability.
While Figure 4(a) is cited as visual evidence, this finding lacks quantitative measurement of 'head-wise diversity', 'spatial focus', 'category selectivity', or 'instance separability'. For a finding claim, visual observation alone without quantitati

已证实 (90%) At a reference timestep t⋆ during the pre-generation trajectory, we select attention heads from an intermediate layer ℓ⋆.
The design choice is fully specified with concrete parameters: t*=20 and ℓ*=15, as stated in paragraph 49.

已证实 (90%) We design three complementary scores to measure the separability: 1) Foreground-background separation S1, 2) Structural richness S2, 3) Edge clarity S3.
All three scores are fully specified in paragraph 20 with clear definitions: S1 (standard deviation of intensities), S2 (variance across blocks), S3 (Sobel gradient magnitude).

已证实 (90%) The overall discriminability score for head h is a weighted sum formed as: S(SA_h) = S1_h + S2_h + γ·S3_h, where γ > 0 balances the contribution of edge clarity against the global contrast and intermediate-scale structure.
The formula is explicitly stated in paragraph 20-21 with the weighted sum structure and the role of γ clearly explained.

已证实 (90%) For each target noun token T in the prompt, we select its best cross-attention head h*_c(T) = argmax_h C^h_T based on peak activation.
The selection criterion for cross-attention heads is fully specified in paragraph 22 with the argmax formula based on peak activation.

已证实 (90%) Spatial proposals are generated by partitioning the self-attention map A_s into contiguous regions using clustering.
The clustering-based partitioning is described in paragraph 26 with reference to mean shift clustering.

已证实 (90%) A_c,T is processed by suppressing values below a 0.1 peak-ratio threshold to isolate peak responses.
The 0.1 peak-ratio threshold is explicitly specified in paragraph 26 for suppressing low values in cross-attention maps.

已证实 (90%) A region is retained as a valid instance if the semantic overlap score S_o ≥ τ.
The threshold τ for semantic overlap score is described in paragraph 27-28 for retaining valid instances.

... 共 53 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

No code repository available - core algorithm implementation details for attention extraction and layout guidance modification are not provided
CountBench benchmark with 210 prompts is not publicly available
Random seeds for noise vector sampling not specified
Exact attention extraction methodology and normalization procedures not detailed
Threshold values for numerical misalignment identification not provided
GroundingDINO model version, configuration, and confidence thresholds for object detection not specified
CLIP model version used for evaluation not specified
Exact text prompts used for GroundingDINO category-specific detection not provided
Layer indexing convention for Wan T2V model not clarified (which layer corresponds to ℓ*=15)
Details on how scene layout is extracted from attention maps not fully specified

局限性（作者自述）

While NUMINA significantly improves numerical alignment, achieving perfect accuracy across all scenarios remains challenging.
Generating very dense instances (e.g., tens or hundreds) remains unexplored.

本分析由 PDF 阅读助手自动生成，仅供参考，不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析，可能存在偏差。原始论文请参阅 arXiv。

分析时间：2026-04-19T07:13:26+00:00 · 数据来源：Paper Collector