HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

248K/year

🤖 AI Summary

This work addresses the limitations of general-purpose vision-language models in meeting the demands of real-world embodied agents—particularly in spatial reasoning, temporal understanding, and interactive planning—by introducing MoT, a foundation model tailored for embodied intelligence. MoT employs a Mixture-of-Transformers architecture to enable modality-specific computation, integrating latent token representations, iterative self-evolution post-training, and on-policy policy distillation to significantly enhance performance while maintaining edge deployability. Experimental results demonstrate that the 2B-parameter variant of MoT outperforms same-scale state-of-the-art models on 16 out of 22 benchmarks, while the 32B version matches the performance of Gemini 3.0 Pro and exhibits exceptional generalization capabilities in real-world robotic control tasks.

Technology Category

Application Category

📝 Abstract

We introduce HY-Embodied-0.5, a family of foundation models specifically designed for real-world embodied agents. To bridge the gap between general Vision-Language Models (VLMs) and the demands of embodied agents, our models are developed to enhance the core capabilities required by embodied intelligence: spatial and temporal visual perception, alongside advanced embodied reasoning for prediction, interaction, and planning. The HY-Embodied-0.5 suite comprises two primary variants: an efficient model with 2B activated parameters designed for edge deployment, and a powerful model with 32B activated parameters targeted for complex reasoning. To support the fine-grained visual perception essential for embodied tasks, we adopt a Mixture-of-Transformers (MoT) architecture to enable modality-specific computing. By incorporating latent tokens, this design effectively enhances the perceptual representation of the models. To improve reasoning capabilities, we introduce an iterative, self-evolving post-training paradigm. Furthermore, we employ on-policy distillation to transfer the advanced capabilities of the large model to the smaller variant, thereby maximizing the performance potential of the compact model. Extensive evaluations across 22 benchmarks, spanning visual perception, spatial reasoning, and embodied understanding, demonstrate the effectiveness of our approach. Our MoT-2B model outperforms similarly sized state-of-the-art models on 16 benchmarks, while the 32B variant achieves performance comparable to frontier models such as Gemini 3.0 Pro. In downstream robot control experiments, we leverage our robust VLM foundation to train an effective Vision-Language-Action (VLA) model, achieving compelling results in real-world physical evaluations. Code and models are open-sourced at https://github.com/Tencent-Hunyuan/HY-Embodied.

Problem

Research questions and friction points this paper is trying to address.

embodied intelligence

vision-language models

spatial reasoning

temporal perception

real-world agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Transformers

embodied reasoning

on-policy distillation