Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In language-conditioned behavioral cloning, action sequences suffer from physical discontinuities and semantic-physical misalignment due to compounding errors. To address this, we propose a semantic-physical bidirectional alignment mechanism and a vision-language-action continuous cross-modal co-learning framework. Our method employs bidirectional cross-attention to achieve temporally consistent alignment among language instructions, visual observations, and proprioceptive signals, ensuring smooth action generation and fine-grained semantic grounding. Evaluated on three simulation benchmarks, our approach achieves an average performance gain of 8.0%, with a 19.2% improvement on the bimanual insertion task. Furthermore, it demonstrates strong generalization to unseen scenes and state noise on a real 7-DOF robotic platform. The core contribution is the first integration of semantic-physical bidirectional alignment into embodied behavioral cloning, significantly enhancing policy robustness and execution accuracy.

Technology Category

Application Category

📝 Abstract
Language-conditioned manipulation facilitates human-robot interaction via behavioral cloning (BC), which learns control policies from human demonstrations and serves as a cornerstone of embodied AI. Overcoming compounding errors in sequential action decisions remains a central challenge to improving BC performance. Existing approaches mitigate compounding errors through data augmentation, expressive representation, or temporal abstraction. However, they suffer from physical discontinuities and semantic-physical misalignment, leading to inaccurate action cloning and intermittent execution. In this paper, we present Continuous vision-language-action Co-Learning with Semantic-Physical Alignment (CCoL), a novel BC framework that ensures temporally consistent execution and fine-grained semantic grounding. It generates robust and smooth action execution trajectories through continuous co-learning across vision, language, and proprioceptive inputs (e.g., robot internal states). Meanwhile, we anchor language semantics to visuomotor representations by a bidirectional cross-attention to learn contextual information for action generation, successfully overcoming the problem of semantic-physical misalignment. Extensive experiments show that CCoL achieves an average 8.0% relative improvement across three simulation suites, with up to 19.2% relative gain in human-demonstrated bimanual insertion tasks. Real-world tests on a 7-DoF robot further confirm CCoL's generalization under unseen and noisy object states.
Problem

Research questions and friction points this paper is trying to address.

Overcoming compounding errors in sequential action decisions for behavioral cloning
Addressing physical discontinuities and semantic-physical misalignment in robot control
Improving action cloning accuracy and execution consistency in human-robot interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous co-learning across vision, language, and proprioceptive inputs
Bidirectional cross-attention for semantic-physical alignment
Generating robust and smooth action execution trajectories
🔎 Similar Papers
No similar papers found.
X
Xiuxiu Qi
The College of Artificial Intelligence & Shenzhen Research Institute, Nankai University, Tianjin, China.
Y
Yu Yang
Centre for Learning, Teaching and Technology, The Education University of Hong Kong, Hong Kong SAR, China.
Jiannong Cao
Jiannong Cao
IEEE Fellow; Chair Professor, Hong Kong Polytechnic University
Distributed computingMobile and pervasive computingWireless sensor networksCloud computingBig Data
L
Luyao Bai
Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, China.
C
Chongshan Fan
The College of Artificial Intelligence & Shenzhen Research Institute, Nankai University, Tianjin, China.
Chengtai Cao
Chengtai Cao
Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China.
Hongpeng Wang
Hongpeng Wang
Robotic Institute, nankai university
Intelligent Robotics、Artificial Intelligence