HoverAI: An Embodied Aerial Agent for Natural Human-Drone Interaction

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work proposes a novel aerial embodied intelligent agent designed to address the challenge of interaction uncertainty arising from unintelligible drone intentions in human-inhabited spaces due to the lack of effective communication mechanisms. The system uniquely integrates flight capability, infrastructure-free MEMS laser projection paired with an onboard semi-rigid screen, and an adaptive multimodal dialogue AI. Leveraging an RGB camera for visual and speech input, it combines voice activity detection (VAD), Whisper-based transcription, LLM-driven intent classification, a RAG-enhanced dialogue system, facial analysis, and XTTS v2 for lip-synced, personalized avatar responses. Evaluated in naturalistic interactions, the system achieves high accuracy in command recognition (F1: 0.90), demographic attribute estimation (gender F1: 0.89; age MAE: 5.14 years), and speech transcription (WER: 0.181), demonstrating robust spatial awareness and socially responsive capabilities.

Technology Category

Application Category

📝 Abstract

Drones operating in human-occupied spaces suffer from insufficient communication mechanisms that create uncertainty about their intentions. We present HoverAI, an embodied aerial agent that integrates drone mobility, infrastructure-independent visual projection, and real-time conversational AI into a unified platform. Equipped with a MEMS laser projector, onboard semi-rigid screen, and RGB camera, HoverAI perceives users through vision and voice, responding via lip-synced avatars that adapt appearance to user demographics. The system employs a multimodal pipeline combining VAD, ASR (Whisper), LLM-based intent classification, RAG for dialogue, face analysis for personalization, and voice synthesis (XTTS v2). Evaluation demonstrates high accuracy in command recognition (F1: 0.90), demographic estimation (gender F1: 0.89, age MAE: 5.14 years), and speech transcription (WER: 0.181). By uniting aerial robotics with adaptive conversational AI and self-contained visual output, HoverAI introduces a new class of spatially-aware, socially responsive embodied agents for applications in guidance, assistance, and human-centered interaction.

Problem

Research questions and friction points this paper is trying to address.

human-drone interaction

communication mechanisms

intent uncertainty

embodied aerial agent

socially responsive agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

embodied aerial agent

multimodal interaction

onboard visual projection