🤖 AI Summary
This paper addresses the loose coupling between multimodal large language models (MLLMs) and world models (WMs), as well as the misalignment between semantic intent and dynamic state representations. To this end, we propose a task-aware bidirectional coupling framework. Methodologically, we design a forward semantic injection mechanism and a backward reward feedback mechanism to bridge the semantic and state spaces; introduce an encoder–world model–policy decoder architecture enabling dynamic joint learning and behavior adaptation; and incorporate dense text-conditioned rewards to achieve cross-modal alignment. Experiments demonstrate that our framework significantly outperforms existing methods in multi-task learning and cross-environment generalization, exhibiting superior stability and generalization capability. It establishes a novel paradigm for semantic-driven embodied agent modeling in open-world scenarios.
📝 Abstract
Building generalist embodied agents requires a unified system that can interpret multimodal goals, model environment dynamics, and execute reliable actions across diverse real-world tasks. Multimodal large language models (MLLMs) offer strong semantic priors and cross-modal generalization, while world models (WMs) provide actionable latent dynamics for prediction and control. Their combination holds promise for open-ended embodied intelligence, yet introduces two key challenges: (1) establishing a tight coupling between the semantic intent from MLLMs and the dynamic state representations within the WM's latent space, and (2) achieving task-aware adaptability that supports multi-task learning and cross-environment generalization. To address these limitations, we propose BiTAgent, a task-aware dynamic joint framework that enables bidirectional coupling between MLLMs and WMs. BiTAgent establishes two complementary pathways: a forward path that injects MLLM representations into the WM's latent space for semantically guided imagination, and a backward path where WM-generated feedback refines the MLLM's semantic space via dense text-conditioned rewards. This bidirectional interaction is realized through three synergistic components: Task-Aware Dynamic Joint Learning, Task-Aware Behavior Learning, and MLLM-WM Joint Optimization, which together harmonize semantic reasoning and dynamic prediction. Extensive experiments across multi-task and cross-environment settings demonstrate superior stability and generalization over state-of-the-art baselines, marking a step toward open-ended embodied learning.