Robust Understanding of Human-Robot Social Interactions through Multimodal Distillation

📅 2025-05-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of real-time, robust human-robot interaction (HRI) scene understanding under incomplete information and noisy sensory inputs, this paper proposes the first multimodal-to-unimodal robust knowledge distillation framework for social HRI understanding. The framework employs only lightweight body pose as input and leverages a multimodal teacher model—integrating pose, facial expression, gesture, gaze, and visual scene features—to guide a unimodal student model via adversarial robust training and information bottleneck optimization. Evaluated on two benchmark HRI datasets, our method achieves an average accuracy improvement of 14.75% over baseline approaches. It maintains stable performance even under 51% input corruption. Moreover, the student model is less than 0.01× the size of the teacher and incurs only 0.05% of its computational cost, significantly enhancing deployment feasibility and cross-scenario generalization capability.

Technology Category

Application Category

📝 Abstract
The need for social robots and agents to interact and assist humans is growing steadily. To be able to successfully interact with humans, they need to understand and analyse socially interactive scenes from their (robot's) perspective. Works that model social situations between humans and agents are few; and even those existing ones are often too computationally intensive to be suitable for deployment in real time or on real world scenarios with limited available information. We propose a robust knowledge distillation framework that models social interactions through various multimodal cues, yet is robust against incomplete and noisy information during inference. Our teacher model is trained with multimodal input (body, face and hand gestures, gaze, raw images) that transfers knowledge to a student model that relies solely on body pose. Extensive experiments on two publicly available human-robot interaction datasets demonstrate that the our student model achieves an average accuracy gain of 14.75% over relevant baselines on multiple downstream social understanding task even with up to 51% of its input being corrupted. The student model is highly efficient: it is $<1$% in size of the teacher model in terms of parameters and uses $sim 0.5$ extperthousand~FLOPs of that in the teacher model. Our code will be made public during publication.
Problem

Research questions and friction points this paper is trying to address.

Modeling human-robot social interactions robustly with limited data
Reducing computational intensity for real-time social scene analysis
Distilling multimodal social cues into efficient body pose models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal knowledge distillation for social interaction modeling
Robust against incomplete and noisy input data
Efficient student model using only body pose
🔎 Similar Papers
No similar papers found.