HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the challenge that end-to-end vision-language-action models often lose the high-level reasoning capabilities of vision-language models (VLMs) during fine-tuning, struggling to jointly achieve semantic planning and precise manipulation. To overcome this, the authors propose a visually anchored hierarchical framework that decouples high-level semantic planning from low-level action control. In this architecture, a high-level VLM generates subtask instructions annotated with target bounding boxes, while a low-level diffusion Transformer (DiT), equipped with a novel cascaded cross-attention mechanism, executes accurate actions. The design preserves the VLM’s zero-shot reasoning ability, enables independent optimization of planning and execution modules, and enhances generalization by integrating global context, high-resolution object crops, and skill semantics. Experiments demonstrate that the method significantly outperforms existing end-to-end approaches in both simulation and real-world settings, particularly excelling in long-horizon skill composition and fine-grained manipulation of small objects in cluttered scenes.

Technology Category

Application Category

📝 Abstract

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

embodied manipulation

semantic planning

motor control

visual grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Embodied Manipulation

Visual Grounding

Diffusion Transformer