Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning

📅 2025-11-18

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

In language-conditioned behavioral cloning, action sequences suffer from physical discontinuities and semantic-physical misalignment due to compounding errors. To address this, we propose a semantic-physical bidirectional alignment mechanism and a vision-language-action continuous cross-modal co-learning framework. Our method employs bidirectional cross-attention to achieve temporally consistent alignment among language instructions, visual observations, and proprioceptive signals, ensuring smooth action generation and fine-grained semantic grounding. Evaluated on three simulation benchmarks, our approach achieves an average performance gain of 8.0%, with a 19.2% improvement on the bimanual insertion task. Furthermore, it demonstrates strong generalization to unseen scenes and state noise on a real 7-DOF robotic platform. The core contribution is the first integration of semantic-physical bidirectional alignment into embodied behavioral cloning, significantly enhancing policy robustness and execution accuracy.

Technology Category

Application Category

📝 Abstract

Language-conditioned manipulation facilitates human-robot interaction via behavioral cloning (BC), which learns control policies from human demonstrations and serves as a cornerstone of embodied AI. Overcoming compounding errors in sequential action decisions remains a central challenge to improving BC performance. Existing approaches mitigate compounding errors through data augmentation, expressive representation, or temporal abstraction. However, they suffer from physical discontinuities and semantic-physical misalignment, leading to inaccurate action cloning and intermittent execution. In this paper, we present Continuous vision-language-action Co-Learning with Semantic-Physical Alignment (CCoL), a novel BC framework that ensures temporally consistent execution and fine-grained semantic grounding. It generates robust and smooth action execution trajectories through continuous co-learning across vision, language, and proprioceptive inputs (e.g., robot internal states). Meanwhile, we anchor language semantics to visuomotor representations by a bidirectional cross-attention to learn contextual information for action generation, successfully overcoming the problem of semantic-physical misalignment. Extensive experiments show that CCoL achieves an average 8.0% relative improvement across three simulation suites, with up to 19.2% relative gain in human-demonstrated bimanual insertion tasks. Real-world tests on a 7-DoF robot further confirm CCoL's generalization under unseen and noisy object states.

Problem

Research questions and friction points this paper is trying to address.

Overcoming compounding errors in sequential action decisions for behavioral cloning

Addressing physical discontinuities and semantic-physical misalignment in robot control

Improving action cloning accuracy and execution consistency in human-robot interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous co-learning across vision, language, and proprioceptive inputs

Bidirectional cross-attention for semantic-physical alignment

Generating robust and smooth action execution trajectories

🔎 Similar Papers

Never-Ending Behavior-Cloning Agent for Robotic Manipulation