OmniShow introduces the first unified framework for Human-Object Interaction Video Generation, simultaneously conditioning on text, images, audio, and pose.
核心问题
How can we unify multiple multimodal conditions (text, reference images, audio, and pose sequences) in an end-to-end framework for high-quality Human-Object Interaction Video Generation?
核心方法
{'approach': 'OmniShow is built on Waver 1.0 (12B MMDiT-based model) with three key components: Unified Channel-wise Conditioning injects reference images and pose through channel concatenation, Gated Local-Context Attention ensures audio-visual synchronization with only ~2.5% parameter increase, and Decoupled-Then-Joint Training merges separately trained R2V and A2V models via weight interpolation (0.6 for A2V, 0.4 for R2V). The authors also establish HOIVG-Bench with 135 samples for comprehensive evaluation.', 'key_components': ['OmniShow is built on Waver 1.0, a powerful 12B MMDiT-based latent diffusion model.', 'Unified Channel-wise Conditioning injects reference images and pose without disrupting generative priors.', 'Gated Local-Context Attention ensures precise synchronization between audio and human dynamics.', 'Decoupled-Then-Joint Training effectively harnesses heterogeneous datasets for training.', 'The merged model achieves zero-shot RA2V generation without explicit training.', 'Videos successfully respect both reference images and audio inputs simultaneously.', 'Preserving native input structure enables efficient transfer of pretrained I2V capability.', 'Channel-wise conditioning minimizes task adaptation gap compared to hybrid token approaches.'], 'section_ids': ['sec_3', 'sec_13']}
论点验证
The paper provides substantial evidence that OmniShow handles all four multimodal conditions (text, reference image, audio, pose) simultaneously. Table 1 shows quantitative results across R2V, RA2V, RP2V, and RAP2V settings. The literature review (p_
The paper provides ablation studies demonstrating the effectiveness of both techniques. Table 2a (p_55) shows Unified Channel-wise Conditioning yields superior video quality and reference consistency compared to token concatenation. The ablation also
The paper describes the multi-stage training strategy in detail (p_7, p_35-36) and provides comparative results in table 2c (p_56). The ablation compares against 'Only RA2V' and 'R2V/A2V→RA2V' baselines, showing the proposed strategy achieves better
The paper provides detailed description of HOIVG-Bench construction in p_37-46, including sample curation criteria, processing pipeline, and evaluation metrics. The benchmark is used throughout the experimental section for systematic evaluation. This
The paper explicitly states the benchmark comprises 135 samples in p_37, and provides detailed description of the curation process (p_38-46) and metrics (p_47). This is a straightforward factual claim with clear evidence.
The paper explicitly states in p_15 that the framework is 'Built upon Waver 1.0 [74] (section 3.1), a powerful 12B MMDiT-based model' and lists four key components. This is a straightforward architectural claim with clear evidence.
The paper explicitly describes this encoding process in p_21 with the exact mathematical notation. This is a technical implementation detail that is fully specified.
The paper explicitly describes this loss configuration in p_24, including the initialization strategy, loss type, and weight value of 1. This is a fully specified design choice.
The paper describes the audio feature extraction pipeline in p_30, including the use of Wav2Vec 2.0 and merging representations from multiple layers. The approach is fully specified.
The paper explicitly states the sliding window size w=5 and stride s=4 in p_30. These are specific hyperparameter values that are fully documented.
The paper explicitly describes the gating vector initialization in p_32, including the dimension H and initialization value of 1e-5. This is a fully specified design choice.
The paper states in p_33 that audio attention is inserted only into dual-stream blocks, increasing model scale by ~2.5% to 12.3B total. The specific placement and parameter count are documented.
The paper explicitly describes the model merging strategy in p_36, including inheriting audio modules from A2V model and linear interpolation weights of 0.6 and 0.4. This is a fully specified design choice.
The paper states in p_36 that 'pose is introduced only in the final fine-tuning stage to prevent overfitting.' The rationale and timing are explicitly documented.
The paper explicitly states in p_44 that DWPose is used for pose extraction. This is a straightforward implementation detail.
The paper describes the complete audio synthesis pipeline in p_45, including GPT-4o for script generation and attribute analysis, and ElevenLabs for audio synthesis. All components are specified.
The paper explicitly states in p_47 that all metrics are standardized to 5-second clips at 720p resolution in portrait mode. This is a clear specification of evaluation settings.
The paper explicitly states the optimizer settings in p_48, including AdamW optimizer, learning rate of 3×10^-5, and weight decay of 0.01. These are fully specified hyperparameters.
The paper provides comparative model sizes in p_49-50: OmniShow (12.3B), Hunyuan-Custom (13B), HuMo (17B/1.7B), VACE (14B), Phantom (14B/1.3B). Among 10B+ models, OmniShow at 12.3B is indeed the smallest. The claim is supported by the comparative dat
The paper states in p_50 that OmniShow 'matches the reference preservation capabilities of specialized methods like Phantom-14B, as evidenced by our comparable FaceSim and NexusScore.' Table 1 provides the quantitative comparison supporting this clai
... 共 41 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- 代码不可用 - 未找到官方代码仓库
- 数据不可用 - 训练和评估数据集未公开
- 训练轮数/迭代次数未指定
- 批次大小(batch size)未说明
- 随机种子未提及
- 数据集划分方式(训练/验证/测试)未详细说明
- 具体使用的数据集名称和来源未在提供的片段中明确
- 特征提取的预处理步骤未详细描述
- 评估指标HOIVG-Bench的具体实现细节缺失
- 两个训练阶段(480p和720p)各自的超参数设置是否相同未说明
局限性(作者自述)
- While our model is capable of generating longer videos (up to 10 seconds), our current evaluation focuses on 5-second clips to ensure a fair comparison with baselines that only support short-clip generation.
- The human reference images in our benchmark is AI-generated, which might introduce slight distribution biases compared to purely real-world photos.
- In some extreme scenarios involving overly intense motion or conflicting multimodal inputs, the model may occasionally exhibit artifacts or blur in generated videos.
- Reinforcement Learning (RL)-based post-training methods are worth fully exploring in this domain.
- We aim to scale up the training data and model capacity to push the boundaries of the model's generalization ability in complex scenarios.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-22T01:10:50+00:00 · 数据来源:Paper Collector