BiTAgent: A Task-Aware Modular Framework for Bidirectional Coupling between Multimodal Large Language Models and World Models

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the loose coupling between multimodal large language models (MLLMs) and world models (WMs), as well as the misalignment between semantic intent and dynamic state representations. To this end, we propose a task-aware bidirectional coupling framework. Methodologically, we design a forward semantic injection mechanism and a backward reward feedback mechanism to bridge the semantic and state spaces; introduce an encoder–world model–policy decoder architecture enabling dynamic joint learning and behavior adaptation; and incorporate dense text-conditioned rewards to achieve cross-modal alignment. Experiments demonstrate that our framework significantly outperforms existing methods in multi-task learning and cross-environment generalization, exhibiting superior stability and generalization capability. It establishes a novel paradigm for semantic-driven embodied agent modeling in open-world scenarios.

Technology Category

Application Category

📝 Abstract
Building generalist embodied agents requires a unified system that can interpret multimodal goals, model environment dynamics, and execute reliable actions across diverse real-world tasks. Multimodal large language models (MLLMs) offer strong semantic priors and cross-modal generalization, while world models (WMs) provide actionable latent dynamics for prediction and control. Their combination holds promise for open-ended embodied intelligence, yet introduces two key challenges: (1) establishing a tight coupling between the semantic intent from MLLMs and the dynamic state representations within the WM's latent space, and (2) achieving task-aware adaptability that supports multi-task learning and cross-environment generalization. To address these limitations, we propose BiTAgent, a task-aware dynamic joint framework that enables bidirectional coupling between MLLMs and WMs. BiTAgent establishes two complementary pathways: a forward path that injects MLLM representations into the WM's latent space for semantically guided imagination, and a backward path where WM-generated feedback refines the MLLM's semantic space via dense text-conditioned rewards. This bidirectional interaction is realized through three synergistic components: Task-Aware Dynamic Joint Learning, Task-Aware Behavior Learning, and MLLM-WM Joint Optimization, which together harmonize semantic reasoning and dynamic prediction. Extensive experiments across multi-task and cross-environment settings demonstrate superior stability and generalization over state-of-the-art baselines, marking a step toward open-ended embodied learning.
Problem

Research questions and friction points this paper is trying to address.

Establish bidirectional coupling between MLLMs and world models for semantic-dynamic alignment.
Achieve task-aware adaptability for multi-task learning and cross-environment generalization.
Harmonize semantic reasoning with dynamic prediction in embodied agents.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional coupling between MLLMs and world models
Forward path injects MLLM representations for guided imagination
Backward path uses WM feedback to refine semantic space
🔎 Similar Papers
No similar papers found.