HY-Embodied-0.5 presents embodied foundation models using Mixture-of-Transformers architecture with visual latent tokens. Trained on 100M+ samples, MoT-2B achieves 58.0% on 22 benchmarks (outperforming Qwen3-VL-4B by 10.2%), while MoE-A32B reaches 67.0%, surpassing Gemini 3.0 Pro.
核心问题
How can we build vision-language foundation models specifically designed for real-world embodied agents that excel at fine-grained visual perception, spatial reasoning, and physical interaction tasks?
核心方法
{'approach': 'The paper introduces HY-Embodied-0.5 with two variants: MoT-2B for edge deployment and MoE-32B for complex tasks. Key architectural innovations include native-resolution ViT, Mixture-of-Transformers for modality-adaptive computation, and visual latent tokens. Training involves 100M+ samples across perception, spatial, and embodied data, followed by iterative post-training with RL, rejection sampling fine-tuning, and large-to-small on-policy distillation.', 'key_components': ['The model architecture combines a vision encoder with a large language model following the standard VLM paradigm.', 'HY-ViT 2.0 provides native-resolution input support and accurate perception within a lightweight footprint.', 'Mixture-of-Transformers architecture introduces non-shared parameters for vision branch to boost visual performance while preserving language capabilities.', "Visual latent tokens with specific supervision improve the model's overall perceptual capacity.", 'The CoT mechanism enables systematic step-by-step analysis of spatial relationships and affordances in embodied reasoning tasks.', 'Models exhibit advanced self-reflection and correction capabilities, explicitly pausing to reconsider structural details.', 'Visual attention maps precisely localize salient objects and specific object parts relevant to scene context.', 'The visual latent tokens effectively bridge the modality gap by aligning fine-grained visual features with linguistic concepts.'], 'section_ids': ['sec_3', 'sec_35']}
论点验证
The paper presents HY-Embodied-0.5 with comprehensive details including architecture (Section 2), data construction (Section 3), training pipeline (Section 4), and evaluation (Section 5). The model family is clearly described and substantiated throug
The paper explicitly describes both model variants with specific parameter counts: MoT-2B (2B activated/4B total) in p_4, p_8, and MoE-A32B (32B activated/407B total) in p_4, p_81. Both variants are evaluated with results reported.
The paper describes HY-ViT 2.0-400M in detail (p_8-p_9), including native-resolution support, distillation training, and 400M parameter count. The architecture is fully specified.
The MoT architecture is described in detail in p_10, including non-shared parameters for vision and language branches, bidirectional attention for visual tokens, and the rationale for modality-adaptive computation.
Visual latent tokens are described in p_8 and p_11, including their placement at the end of visual sequences and the global loss supervision during pre-training.
The paper provides specific dataset sizes: 62M Omni-Detection (p_15), 36M depth estimation (p_16), 5M segmentation (p_17), 11M pointing/counting (p_18), plus embodied and spatial data. The perception data alone exceeds 100M.
The iterative post-training paradigm is described in detail in p_61-p_64, including the alternation between RL and RFT stages, with clear explanation of how each component contributes.
The paper describes the on-policy distillation method in p_65-p_69 but does not provide ablation results comparing model performance with vs. without distillation. The claim of 'significantly improve' lacks quantitative evidence.
The paper lists all 22 benchmarks in p_72-p_73, organized into visual perception, spatial reasoning, and embodied understanding categories. Table 1 and Table 2 show results across all benchmarks.
The paper states in p_77 that the model achieves best performance on 16/22 benchmarks. Table 1 shows detailed results that can be verified against this claim.
Specific quantitative results are provided in p_6: 58.0% average score, outperforming Qwen3-VL-4B by 10.2% and RoboBrain2.5-4B by 8.6%. These can be verified from Table 1.
Specific quantitative results are provided in p_6 and p_81: MoE-A32B achieves 67.0% average, surpassing Gemini 3.0 Pro (63.6%).
The paper describes the 400M ViT and mentions it's 'optimized for edge-device deployment' but provides no quantitative efficiency metrics (inference speed, memory usage, latency) to substantiate the 'efficient' claim.
The MoT architecture with modality-adaptive computation is clearly described in p_10, including the mechanism of non-shared parameters for vision and language tokens.
The paper explicitly states in p_10: 'we design distinct attention mask patterns for visual and text tokens' and explains bidirectional attention for visual tokens.
The visual next-code prediction task is described in p_10, including the use of discrete visual representations from a larger ViT as supervision.
The paper states in p_11: 'appending a learnable visual latent token to the end of each visual element' and describes the design.
The paper states in p_11: 'during the pre-training phase, we use the global features from a large ViT to supervise the output features of this token'.
The paper states in p_9: 'we employ a 400M-parameter ViT model for HY-Embodied-0.5 and train it via distillation from a more powerful internal ViT'.
The paper states in p_15: 'We obtain 62M Omni-Detection data in total.'
... 共 61 个论点
可复现性评估
较低可复现性 (0%)
缺失的复现细节
- No code available - implementation details not accessible
- No training data available - dataset composition and sources unknown
- Training hyperparameters missing (learning rate, batch size, epochs, optimizer settings)
- Model architecture specifics not provided (hidden dimensions, number of layers, attention heads, parameter counts)
- Hardware/environment specifications for training not documented
- Random seeds not specified for reproducibility
- Training data preprocessing steps not described
- Loss functions and training objectives not detailed
- Evaluation benchmark details and data splits not specified
- Mixture-of-Transformers implementation details unclear
局限性(作者自述)
论文中未明确列出局限性。
本分析由 PDF 阅读助手 自动生成,仅供参考,不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析,可能存在偏差。原始论文请参阅 arXiv。
分析时间:2026-04-19T01:09:57+00:00 · 数据来源:Paper Collector