Toward Aligning Human and Robot Actions via Multi-Modal Demonstration Learning

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of decision alignment in human-robot collaboration within unstructured environments, this paper proposes a multimodal imitation learning framework that, for the first time, jointly models human RGB video demonstrations and robot-centric 3D voxelized RGB-D demonstrations. Methodologically, we design a dual-stream alignment architecture that jointly encodes human intent and predicts robot actions, integrating a ResNet-based visual encoder with a Perceiver Transformer module for voxel processing to achieve cross-modal behavioral semantic matching. Evaluated on the RH20T dataset (5 users, 10 scenes), our approach achieves 71.67% accuracy in human intent recognition and 71.8% in robot action prediction, demonstrating significant efficacy in cross-modal intent–action alignment. Key contributions include: (i) the first unified framework for jointly modeling 2D human visual and 3D robot voxel demonstrations; (ii) a dual-stream alignment mechanism enabling interpretable cross-modal semantic mapping; and (iii) a novel paradigm for imitation learning and human–robot co-adaptation in unstructured settings.

Technology Category

Application Category

📝 Abstract
Understanding action correspondence between humans and robots is essential for evaluating alignment in decision-making, particularly in human-robot collaboration and imitation learning within unstructured environments. We propose a multimodal demonstration learning framework that explicitly models human demonstrations from RGB video with robot demonstrations in voxelized RGB-D space. Focusing on the"pick and place"task from the RH20T dataset, we utilize data from 5 users across 10 diverse scenes. Our approach combines ResNet-based visual encoding for human intention modeling and a Perceiver Transformer for voxel-based robot action prediction. After 2000 training epochs, the human model reaches 71.67% accuracy, and the robot model achieves 71.8% accuracy, demonstrating the framework's potential for aligning complex, multimodal human and robot behaviors in manipulation tasks.
Problem

Research questions and friction points this paper is trying to address.

Aligning human and robot actions via multimodal learning
Modeling human-robot action correspondence in unstructured environments
Improving imitation learning accuracy in pick-and-place tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal learning aligns human and robot actions
ResNet and Perceiver Transformer model intentions
Voxelized RGB-D space for robot demonstrations
🔎 Similar Papers
No similar papers found.
A
Azizul Zahid
Department of Electrical Engineering and Computer Science, University of Tennessee Knoxville, Knoxville, TN 37996, USA
Jie Fan
Jie Fan
Zhejiang University
catalysis and mesoporous materials
F
Farong Wang
Department of Electrical Engineering and Computer Science, University of Tennessee Knoxville, Knoxville, TN 37996, USA
A
Ashton Dy
Department of Electrical Engineering and Computer Science, University of Tennessee Knoxville, Knoxville, TN 37996, USA
Sai Swaminathan
Sai Swaminathan
University of Tennessee
Human-Computer InteractionUbiquitous Computingand many things....
F
Fei Liu
Department of Electrical Engineering and Computer Science, University of Tennessee Knoxville, Knoxville, TN 37996, USA