π€ AI Summary
This work addresses the dual uncertainty arising from missing task knowledge and ambiguous human intent in open-world human-robot collaboration, a challenge inadequately tackled by existing approaches that often treat humans as passive supervisors. To enable proactive coordination, we propose the first bimodal joint planning framework that integrates active uncertainty reduction with implicit intention inference. On one hand, cognitive uncertainty is mitigated through large language model (LLM)-guided active querying and hypothesis-augmented A* search. On the other, human intent is inferred in real time by fusing vision-language modelβdriven 3D semantic perception with spatial directional cues, enabling dynamic collaboration without explicit communication. Experiments demonstrate that our method reduces interaction cost by 51.9% in Gazebo simulation and shortens task execution time by 25.4% on a real-world drone platform, significantly enhancing both collaborative efficiency and naturalness.
π Abstract
Effective human-robot collaboration in open-world environments requires joint planning under uncertain conditions. However, existing approaches often treat humans as passive supervisors, preventing autonomous agents from becoming human-like teammates that can actively model teammate behaviors, reason about knowledge gaps, query, and elicit responses through communication to resolve uncertainties. To address these limitations, we propose a unified human-robot joint planning system designed to tackle dual sources of uncertainty: task-relevant knowledge gaps and latent human intent. Our system operates in two complementary modes. First, an uncertainty-mitigation joint planning module enables two-way conversations to resolve semantic ambiguity and object uncertainty. It utilizes an LLM-assisted active elicitation mechanism and a hypothesis-augmented A^* search, subsequently computing an optimal querying policy via dynamic programming to minimize interaction and verification costs. Second, a real-time intent-aware collaboration module maintains a probabilistic belief over the human's latent task intent via spatial and directional cues, enabling dynamic, coordination-aware task selection for agents without explicit communication. We validate the proposed system in both Gazebo simulations and real-world UAV deployments integrated with a Vision-Language Model (VLM)-based 3D semantic perception pipeline. Experimental results demonstrate that the system significantly cuts the interaction cost by 51.9% in uncertainty-mitigation planning and reduces the task execution time by 25.4% in intent-aware cooperation compared to the baselines.