🤖 AI Summary
This work proposes a novel aerial embodied intelligent agent designed to address the challenge of interaction uncertainty arising from unintelligible drone intentions in human-inhabited spaces due to the lack of effective communication mechanisms. The system uniquely integrates flight capability, infrastructure-free MEMS laser projection paired with an onboard semi-rigid screen, and an adaptive multimodal dialogue AI. Leveraging an RGB camera for visual and speech input, it combines voice activity detection (VAD), Whisper-based transcription, LLM-driven intent classification, a RAG-enhanced dialogue system, facial analysis, and XTTS v2 for lip-synced, personalized avatar responses. Evaluated in naturalistic interactions, the system achieves high accuracy in command recognition (F1: 0.90), demographic attribute estimation (gender F1: 0.89; age MAE: 5.14 years), and speech transcription (WER: 0.181), demonstrating robust spatial awareness and socially responsive capabilities.
📝 Abstract
Drones operating in human-occupied spaces suffer from insufficient communication mechanisms that create uncertainty about their intentions. We present HoverAI, an embodied aerial agent that integrates drone mobility, infrastructure-independent visual projection, and real-time conversational AI into a unified platform. Equipped with a MEMS laser projector, onboard semi-rigid screen, and RGB camera, HoverAI perceives users through vision and voice, responding via lip-synced avatars that adapt appearance to user demographics. The system employs a multimodal pipeline combining VAD, ASR (Whisper), LLM-based intent classification, RAG for dialogue, face analysis for personalization, and voice synthesis (XTTS v2). Evaluation demonstrates high accuracy in command recognition (F1: 0.90), demographic estimation (gender F1: 0.89, age MAE: 5.14 years), and speech transcription (WER: 0.181). By uniting aerial robotics with adaptive conversational AI and self-contained visual output, HoverAI introduces a new class of spatially-aware, socially responsive embodied agents for applications in guidance, assistance, and human-centered interaction.