HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
This work addresses the challenge that end-to-end vision-language-action models often lose the high-level reasoning capabilities of vision-language models (VLMs) during fine-tuning, struggling to jointly achieve semantic planning and precise manipulation. To overcome this, the authors propose a visually anchored hierarchical framework that decouples high-level semantic planning from low-level action control. In this architecture, a high-level VLM generates subtask instructions annotated with target bounding boxes, while a low-level diffusion Transformer (DiT), equipped with a novel cascaded cross-attention mechanism, executes accurate actions. The design preserves the VLM’s zero-shot reasoning ability, enables independent optimization of planning and execution modules, and enhances generalization by integrating global context, high-resolution object crops, and skill semantics. Experiments demonstrate that the method significantly outperforms existing end-to-end approaches in both simulation and real-world settings, particularly excelling in long-horizon skill composition and fine-grained manipulation of small objects in cluttered scenes.

Technology Category

Application Category

📝 Abstract
While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
embodied manipulation
semantic planning
motor control
visual grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Embodied Manipulation
Visual Grounding
Diffusion Transformer
Decoupled Architecture
Flow-Matching