Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current Vision-Language-Action (VLA) models face two key challenges: high-dimensional visual state prediction leads to capability dilution and prohibitive training costs; while visual compression introduces an information bottleneck and—by neglecting language supervision—undermines reasoning capabilities. This paper proposes a decoupled visual foresight framework that separates visual state prediction from action decision-making via a meta-query mechanism and a diffusion Transformer (DiT) head. It further introduces a residual next-state prediction objective and integrates multimodal pretraining with explicit language supervision to alleviate the information bottleneck and strengthen language-guided reasoning. After fine-tuning on the LIBERO benchmark, our method achieves a 96.7% task success rate—substantially outperforming mainstream baselines such as π₀.₅—and demonstrates superior performance in instruction following, generalization to unseen instructions, and compositional reasoning, while converging significantly faster.

Technology Category

Application Category

📝 Abstract
Recent advances in Vision-Language-Action (VLA) models demonstrate that visual signals can effectively complement sparse action supervisions. However, letting VLA directly predict high-dimensional visual states can distribute model capacity and incur prohibitive training cost, while compressing visual states into more compact supervisory signals inevitably incurs information bottlenecks. Moreover, existing methods often suffer from poor comprehension and reasoning capabilities due to the neglect of language supervision. This paper introduces Mantis, a novel framework featuring a Disentangled Visual Foresight (DVF) to tackle these issues. Specifically, Mantis decouples visual foresight prediction from the backbone with the combination of meta queries and a diffusion Transformer (DiT) head. With the current visual state provided to the DiT via a residual connection, a simple next-state prediction objective enables the meta queries to automatically capture the latent actions that delineate the visual trajectory, and hence boost the learning of explicit actions. The disentanglement reduces the burden of the VLA backbone, enabling it to maintain comprehension and reasoning capabilities through language supervision. Empirically, pretrained on human manipulation videos, robot demonstrations, and image-text pairs, Mantis achieves a 96.7% success rate on LIBERO benchmark after fine-tuning, surpassing powerful baselines while exhibiting high convergence speed. Real-world evaluations show that Mantis outperforms $π_{0.5}$, a leading open-source VLA model, particularly in instruction-following capability, generalization to unseen instructions, and reasoning ability. Code and weights are released to support the open-source community.
Problem

Research questions and friction points this paper is trying to address.

VLA models struggle with high-dimensional visual prediction costs and capacity distribution
Compressed visual signals create information bottlenecks in vision-language-action models
Existing methods suffer from poor comprehension due to neglected language supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled Visual Foresight with meta queries
Diffusion Transformer head for next-state prediction
Residual connection maintains visual state context
🔎 Similar Papers
No similar papers found.