HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents - AI 论文深度分析

TL;DR
HY-Embodied-0.5 presents embodied foundation models using Mixture-of-Transformers architecture with visual latent tokens. Trained on 100M+ samples, MoT-2B achieves 58.0% on 22 benchmarks (outperforming Qwen3-VL-4B by 10.2%), while MoE-A32B reaches 67.0%, surpassing Gemini 3.0 Pro.

已证实

证据不足

无法验证

N/A

可复现性

置信度

77%

核心问题

How can we build vision-language foundation models specifically designed for real-world embodied agents that excel at fine-grained visual perception, spatial reasoning, and physical interaction tasks?

核心方法

{'approach': 'The paper introduces HY-Embodied-0.5 with two variants: MoT-2B for edge deployment and MoE-32B for complex tasks. Key architectural innovations include native-resolution ViT, Mixture-of-Transformers for modality-adaptive computation, and visual latent tokens. Training involves 100M+ samples across perception, spatial, and embodied data, followed by iterative post-training with RL, rejection sampling fine-tuning, and large-to-small on-policy distillation.', 'key_components': ['The model architecture combines a vision encoder with a large language model following the standard VLM paradigm.', 'HY-ViT 2.0 provides native-resolution input support and accurate perception within a lightweight footprint.', 'Mixture-of-Transformers architecture introduces non-shared parameters for vision branch to boost visual performance while preserving language capabilities.', "Visual latent tokens with specific supervision improve the model's overall perceptual capacity.", 'The CoT mechanism enables systematic step-by-step analysis of spatial relationships and affordances in embodied reasoning tasks.', 'Models exhibit advanced self-reflection and correction capabilities, explicitly pausing to reconsider structural details.', 'Visual attention maps precisely localize salient objects and specific object parts relevant to scene context.', 'The visual latent tokens effectively bridge the modality gap by aligning fine-grained visual features with linguistic concepts.'], 'section_ids': ['sec_3', 'sec_35']}

论点验证

已证实 (95%) In this report, we present HY-Embodied-0.5, a family of foundation models purpose-built for real-world agents.
The paper presents HY-Embodied-0.5 with comprehensive details including architecture (Section 2), data construction (Section 3), training pipeline (Section 4), and evaluation (Section 5). The model family is clearly described and substantiated throug

已证实 (95%) The HY-Embodied-0.5 family instantiates two primary variants: a highly efficient multimodal Mixture-of-Transformers (MoT) model (2B activated / 4B total parameters) optimized for real-time responsiveness and edge deployment, and a powerful Mixture-of-Experts (MoE) model (32B activated / 407B total parameters) engineered to tackle complex visual perception and embodied reasoning tasks.
The paper explicitly describes both model variants with specific parameter counts: MoT-2B (2B activated/4B total) in p_4, p_8, and MoE-A32B (32B activated/407B total) in p_4, p_81. Both variants are evaluated with results reported.

已证实 (90%) we introduce a lightweight yet powerful native-resolution Vision Transformer (ViT) for visual encoding
The paper describes HY-ViT 2.0-400M in detail (p_8-p_9), including native-resolution support, distillation training, and 400M parameter count. The architecture is fully specified.

已证实 (90%) a Mixture-of-Transformers architecture to enable modality-adaptive computation and improve the model's visual modeling capacity
The MoT architecture is described in detail in p_10, including non-shared parameters for vision and language branches, bidirectional attention for visual tokens, and the rationale for modality-adaptive computation.

已证实 (90%) incorporate visual latent tokens to better connect vision and language
Visual latent tokens are described in p_8 and p_11, including their placement at the end of visual sequences and the global loss supervision during pre-training.

已证实 (85%) we build high-quality perception and embodied pre-training data of over 100M training samples, covering basic perception, spatial perception, embodied perception, and reasoning and planning
The paper provides specific dataset sizes: 62M Omni-Detection (p_15), 36M depth estimation (p_16), 5M segmentation (p_17), 11M pointing/counting (p_18), plus embodied and spatial data. The perception data alone exceeds 100M.

已证实 (90%) we design an iterative, self-evolving post-training paradigm
The iterative post-training paradigm is described in detail in p_61-p_64, including the alternation between RL and RFT stages, with clear explanation of how each component contributes.

证据不足 (50%) through a large-to-small on-policy distillation approach to transfer knowledge from the large model to the small model, we significantly improve the performance of the edge variant of our model
The paper describes the on-policy distillation method in p_65-p_69 but does not provide ablation results comparing model performance with vs. without distillation. The claim of 'significantly improve' lacks quantitative evidence.

已证实 (95%) we construct an evaluation suite comprising 22 public benchmarks, covering visual perception, spatial reasoning, and embodied understanding
The paper lists all 22 benchmarks in p_72-p_73, organized into visual perception, spatial reasoning, and embodied understanding categories. Table 1 and Table 2 show results across all benchmarks.

已证实 (90%) Our HY-Embodied-0.5-MoT-2B achieves the best performance on 16 out of 22 benchmarks among compared generalist and specialist embodied VLMs of similar sizes.
The paper states in p_77 that the model achieves best performance on 16/22 benchmarks. Table 1 shows detailed results that can be verified against this claim.

已证实 (90%) It achieves an average score of 58.0% across all 22 benchmarks, outperforming the generalist VLM Qwen3-VL-4B (Bai et al., 2025) and the specialist embodied VLM RoboBrain2.5-4B (Tan et al., 2026)-both of which have larger activated parameters-by 10.2% and 8.6%, respectively.
Specific quantitative results are provided in p_6: 58.0% average score, outperforming Qwen3-VL-4B by 10.2% and RoboBrain2.5-4B by 8.6%. These can be verified from Table 1.

已证实 (90%) Our most powerful HY-Embodied-0.5-MoE-A32B model achieves an average score of 67.0% across the 22 benchmarks, surpassing the frontier model Gemini 3.0 Pro (63.6%)
Specific quantitative results are provided in p_6 and p_81: MoE-A32B achieves 67.0% average, surpassing Gemini 3.0 Pro (63.6%).

证据不足 (50%) we train an efficient yet powerful native-resolution Vision Transformer (ViT) optimized for edge-device deployment
The paper describes the 400M ViT and mentions it's 'optimized for edge-device deployment' but provides no quantitative efficiency metrics (inference speed, memory usage, latency) to substantiate the 'efficient' claim.

已证实 (90%) we adopt a Mixture-of-Transformers architecture to enable modality-adaptive computation
The MoT architecture with modality-adaptive computation is clearly described in p_10, including the mechanism of non-shared parameters for vision and language tokens.

已证实 (90%) we design distinct attention mask patterns for visual and text tokens
The paper explicitly states in p_10: 'we design distinct attention mask patterns for visual and text tokens' and explains bidirectional attention for visual tokens.

已证实 (90%) we introduce a visual next-code prediction task to better optimize the vision branch in the MoT and provide stronger supervision signals
The visual next-code prediction task is described in p_10, including the use of discrete visual representations from a larger ViT as supervision.

已证实 (90%) we append a learnable visual latent token to the end of each visual element (e.g., an image or a video frame)
The paper states in p_11: 'appending a learnable visual latent token to the end of each visual element' and describes the design.

已证实 (90%) during the pre-training phase, we use the global features from a large ViT to supervise the output features of this token
The paper states in p_11: 'during the pre-training phase, we use the global features from a large ViT to supervise the output features of this token'.

已证实 (90%) we employ a 400M-parameter ViT model for HY-Embodied-0.5 and train it via distillation from a more powerful internal ViT
The paper states in p_9: 'we employ a 400M-parameter ViT model for HY-Embodied-0.5 and train it via distillation from a more powerful internal ViT'.

已证实 (90%) We obtain 62M Omni-Detection data in total
The paper states in p_15: 'We obtain 62M Omni-Detection data in total.'

... 共 61 个论点

可复现性评估

较低可复现性 (0%)

缺失的复现细节

No code available - implementation details not accessible
No training data available - dataset composition and sources unknown
Training hyperparameters missing (learning rate, batch size, epochs, optimizer settings)
Model architecture specifics not provided (hidden dimensions, number of layers, attention heads, parameter counts)
Hardware/environment specifications for training not documented
Random seeds not specified for reproducibility
Training data preprocessing steps not described
Loss functions and training objectives not detailed
Evaluation benchmark details and data splits not specified
Mixture-of-Transformers implementation details unclear

局限性（作者自述）

论文中未明确列出局限性。

本分析由 PDF 阅读助手自动生成，仅供参考，不构成学术评审意见。验证结论和可复现性评估基于论文文本自动分析，可能存在偏差。原始论文请参阅 arXiv。

分析时间：2026-04-19T01:09:57+00:00 · 数据来源：Paper Collector