CoInteract synthesizes physically-consistent human-object interaction videos using a Diffusion Transformer with embedded structural priors. It introduces Human-Aware Mixture-of-Experts for hand/face quality and dual-stream co-generation jointly training RGB and HOI structure streams.
核心问题
The paper investigates how to synthesize physically-consistent human-object interaction videos that maintain structural stability and physical plausibility, addressing limitations of RGB-centric diffusion models lacking 3D spatial understanding.
核心方法
{'approach': 'The framework introduces a Human-Aware Mixture-of-Experts module that routes tokens to region-specialized experts for hands and faces using spatially-supervised routing. It proposes a dual-stream co-generation paradigm that jointly trains an RGB stream with an auxiliary HOI structure stream within a shared DiT backbone, using asymmetric co-attention to enable zero-overhead inference.', 'key_components': ['Video diffusion models have evolved to use DiT-style backbones for modeling spatiotemporal tokens with global attention.', 'RGB-centric models remain fragile in HOI scenarios with weak constraints on contact geometry and body topology.', 'Common failure modes include hand/face distortions and contact violations such as interpenetration.', 'Recent multi-stream co-generation methods target general video synthesis but not HOI-specific challenges.', 'CoInteract injects interaction-structure supervision and region-specific specialization into a shared DiT backbone.', 'CoInteract is an end-to-end framework for speech-driven HOI video synthesis using dual reference images and motion frames.', 'The framework synthesizes HOI videos that are structurally stable and physically plausible.', 'Unlike conventional video diffusion models, CoInteract explicitly injects interaction structure into a shared DiT backbone.'], 'section_ids': ['sec_3', 'sec_6']}
论点验证
The paper provides detailed architecture description (Section 3), quantitative results in Table 1 showing improvements across multiple metrics, qualitative results in Figure 5, and ablation studies in Table 2. The framework is fully specified and exp
The MoE architecture is fully described (p_28-31), and Table 2 ablation shows removing MoE causes drops in HQ (0.724→0.712) and FaceSim (0.696→0.682). The 'marginal increase in parameters' claim is quantified as 1.04× overhead in Table 2.
The dual-stream paradigm is fully described in p_16, and Table 2 ablation provides strong evidence: removing HOI stream causes VLM-QA to drop from 0.72 to 0.48 (-33.3%), demonstrating its importance for physical plausibility.
The framework is described in detail and validated experimentally. The claim about 'no external preprocessing or post-processing' refers to inference time, which is supported by the architecture where the HOI branch is removed at inference. However,
Same as claim_2 - MoE is described in detail and validated through ablation. The 'minimal additional parameters' is quantified as 1.04× overhead in Table 2.
The asymmetric co-attention mechanism is described in p_26-27. Table 2 provides strong evidence: full model achieves VLM-QA 0.72 at 1.00× cost, while retaining HOI branch achieves 0.76 at 4.13× cost. The asymmetric strategy enables zero-overhead infe
Design choice fully specified in Section 3 with detailed architecture description. The shared DiT backbone with dual streams and MoE is clearly described.
Design choice fully specified in p_16 with clear description of RGB stream z_r and HOI structure stream z_h jointly trained in single DiT backbone.
Design choice fully specified in p_16 and p_32, describing how human mesh projection and object mask fusion create the HOI structure stream.
Design choice clearly specified in p_17 - modality-specific patch embedding layers with same patch size feeding into shared DiT blocks.
Design choice specified in p_17 - shared transformer parameters with stream-specific modulation (scale and shift in adaptive layer normalization).
Design choice specified with mathematical formulation in p_18-20, showing joint flow-matching objective for both streams.
Simple hyperparameter setting stated clearly in p_20.
Design choice specified in p_21 with 3D RoPE encoding for token coordinates.
Design choice specified in p_22 with clear coordinate assignment scheme for dual streams.
Design choice specified in p_23 with clear temporal indexing scheme for motion frames.
Design choice specified in p_23 with specific temporal locations (t=30, 31) for reference images.
Design choice specified in p_26 with two-stage training and asymmetric co-attention mechanism.
Design choice specified in p_26 describing Stage 1 with bidirectional attention.
Design choice specified in p_26-27 with asymmetric attention mask formulation, enabling RGB independence from HOI branch.
... 共 51 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- Model architecture specifications (number of layers, hidden dimensions, attention configuration for DiT backbone)
- Training hyperparameters (learning rate, batch size, epochs, optimizer, learning rate schedule)
- Diffusion-specific parameters (number of diffusion steps, noise schedule, guidance scale, sampling method)
- Dataset information (training datasets used, dataset size, train/val/test splits, data preprocessing steps)
- Loss functions and their respective weights for multi-stream co-generation
- Hardware specifications (GPU type, memory requirements, training/inference time)
- Random seeds for reproducibility
- Evaluation metrics and their implementation details
- Baseline methods and comparison protocols
- Input/output specifications (resolution, frame rate, motion frame format)
局限性(作者自述)
- CoInteract instead faithfully preserves the reference scene, which trades marginal aesthetic scores for stronger consistency.
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-23T07:30:39+00:00 · 数据来源:Paper Collector