Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System

📅 2025-06-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of semantic navigation and collaborative manipulation for heterogeneous multi-robot systems in dynamic environments, this paper proposes a hierarchical language model framework that achieves, for the first time, deep integration of large language model (LLM)-driven task reasoning, vision-language model (VLM)-based perception (enhanced with GridMask), and motion planning on a real aerial-ground heterogeneous robotic system. The LLM performs high-level task decomposition and constructs a global semantic map; the VLM extracts spatially constrained semantic labels from aerial imagery to enable local navigation for ground robots and implicit semantic alignment under target occlusion or absence. Cross-modal mapping and global-local cooperative planning ensure semantic continuity and environmental adaptability. Evaluated on a real-world alphabet-block arrangement task, the system successfully accomplishes end-to-end semantic navigation, dynamic replanning, and precise grasping, demonstrating robustness and generalization across varying conditions.

Technology Category

Application Category

📝 Abstract
Heterogeneous multi-robot systems show great potential in complex tasks requiring coordinated hybrid cooperation. However, traditional approaches relying on static models often struggle with task diversity and dynamic environments. This highlights the need for generalizable intelligence that can bridge high-level reasoning with low-level execution across heterogeneous agents. To address this, we propose a hierarchical framework integrating a prompted Large Language Model (LLM) and a GridMask-enhanced fine-tuned Vision Language Model (VLM). The LLM performs task decomposition and global semantic map construction, while the VLM extracts task-specified semantic labels and 2D spatial information from aerial images to support local planning. Within this framework, the aerial robot follows a globally optimized semantic path and continuously provides bird-view images, guiding the ground robot's local semantic navigation and manipulation, including target-absent scenarios where implicit alignment is maintained. Experiments on a real-world letter-cubes arrangement task demonstrate the framework's adaptability and robustness in dynamic environments. To the best of our knowledge, this is the first demonstration of an aerial-ground heterogeneous system integrating VLM-based perception with LLM-driven task reasoning and motion planning.
Problem

Research questions and friction points this paper is trying to address.

Bridging high-level reasoning with low-level execution in heterogeneous robots
Handling task diversity in dynamic multi-robot environments
Integrating semantic perception with task planning for aerial-ground systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical framework with LLM and VLM
LLM for task decomposition and global mapping
VLM for semantic labels and spatial info
🔎 Similar Papers
No similar papers found.
Haokun Liu
Haokun Liu
Vector Institute, University of Toronto
Natural Language Processing
Z
Zhaoqi Ma
DRAGON Lab at Department of Mechanical Engineering, The University of Tokyo, Tokyo, 113-8654, Japan
Yunong Li
Yunong Li
The University of Tokyo, DRAGON Lab
Aerial robotQuadruped robot.
Junichiro Sugihara
Junichiro Sugihara
Ph.D. student, DRAGON Lab, The University of Tokyo,
Aerial roboticsModular robotics
Y
Yicheng Chen
DRAGON Lab at Department of Mechanical Engineering, The University of Tokyo, Tokyo, 113-8654, Japan
J
Jinjie Li
DRAGON Lab at Department of Mechanical Engineering, The University of Tokyo, Tokyo, 113-8654, Japan
Moju Zhao
Moju Zhao
university of tokyo, DRAGON Lab
RoboticsAerial RoboticsMotion ControlMotion PlanningComputer Vision