UGotMe: An Embodied System for Affective Human-Robot Interaction

📅 2024-10-24

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Real-world multiparty dialogues pose two key challenges for visual emotion recognition: environmental noise (e.g., non-speaking participants and cluttered backgrounds) and high end-to-end response latency. To address these, this work proposes UGotMe—a embodied intelligent system that introduces an active face extraction mechanism to suppress visual interference. It jointly optimizes edge-based face detection and tracking, cloud-edge collaborative transmission, and lightweight streaming multimodal inference to achieve low-latency emotion recognition. Deployed end-to-end on the Ameca humanoid robot, UGotMe demonstrates robust emotion understanding in dynamic multiparty scenarios, achieving an end-to-end latency of under 300 ms. This significantly enhances interaction naturalness and practical deployability in real-time human-robot dialogue systems.

Technology Category

Application Category

📝 Abstract

Equipping humanoid robots with the capability to understand emotional states of human interactants and express emotions appropriately according to situations is essential for affective human-robot interaction. However, enabling current vision-aware multimodal emotion recognition models for affective human-robot interaction in the real-world raises embodiment challenges: addressing the environmental noise issue and meeting real-time requirements. First, in multiparty conversation scenarios, the noises inherited in the visual observation of the robot, which may come from either 1) distracting objects in the scene or 2) inactive speakers appearing in the field of view of the robot, hinder the models from extracting emotional cues from vision inputs. Secondly, realtime response, a desired feature for an interactive system, is also challenging to achieve. To tackle both challenges, we introduce an affective human-robot interaction system called UGotMe designed specifically for multiparty conversations. Two denoising strategies are proposed and incorporated into the system to solve the first issue. Specifically, to filter out distracting objects in the scene, we propose extracting face images of the speakers from the raw images and introduce a customized active face extraction strategy to rule out inactive speakers. As for the second issue, we employ efficient data transmission from the robot to the local server to improve realtime response capability. We deploy UGotMe on a human robot named Ameca to validate its real-time inference capabilities in practical scenarios. Videos demonstrating real-world deployment are available at https://pi3-141592653.github.io/UGotMe/.

Problem

Research questions and friction points this paper is trying to address.

Enhance emotion recognition in noisy multiparty conversations.

Achieve real-time response in human-robot interactions.

Filter distractions and inactive speakers for accurate emotion analysis.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Customized active face extraction strategy

Efficient data transmission for real-time response

Denoising strategies for multiparty conversation scenarios

🔎 Similar Papers

No similar papers found.