🤖 AI Summary
To address the challenge of decision alignment in human-robot collaboration within unstructured environments, this paper proposes a multimodal imitation learning framework that, for the first time, jointly models human RGB video demonstrations and robot-centric 3D voxelized RGB-D demonstrations. Methodologically, we design a dual-stream alignment architecture that jointly encodes human intent and predicts robot actions, integrating a ResNet-based visual encoder with a Perceiver Transformer module for voxel processing to achieve cross-modal behavioral semantic matching. Evaluated on the RH20T dataset (5 users, 10 scenes), our approach achieves 71.67% accuracy in human intent recognition and 71.8% in robot action prediction, demonstrating significant efficacy in cross-modal intent–action alignment. Key contributions include: (i) the first unified framework for jointly modeling 2D human visual and 3D robot voxel demonstrations; (ii) a dual-stream alignment mechanism enabling interpretable cross-modal semantic mapping; and (iii) a novel paradigm for imitation learning and human–robot co-adaptation in unstructured settings.
📝 Abstract
Understanding action correspondence between humans and robots is essential for evaluating alignment in decision-making, particularly in human-robot collaboration and imitation learning within unstructured environments. We propose a multimodal demonstration learning framework that explicitly models human demonstrations from RGB video with robot demonstrations in voxelized RGB-D space. Focusing on the"pick and place"task from the RH20T dataset, we utilize data from 5 users across 10 diverse scenes. Our approach combines ResNet-based visual encoding for human intention modeling and a Perceiver Transformer for voxel-based robot action prediction. After 2000 training epochs, the human model reaches 71.67% accuracy, and the robot model achieves 71.8% accuracy, demonstrating the framework's potential for aligning complex, multimodal human and robot behaviors in manipulation tasks.