PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing vision-language-action (VLA) models suffer from high action redundancy and temporal instability in goal-directed action generation, hindering their deployment in real-time embodied tasks. To address this, we propose Pose-Conditioned Anchor Attention—a lightweight, perception-module-free mechanism that enables spatially selective visual focus by directly incorporating joint and end-effector pose supervision into the vision-language model. This facilitates end-to-end generation of compact, robust action sequences tightly aligned with visual-language instructions. Evaluated on multiple robotic manipulation benchmarks, our method significantly reduces action redundancy, improves execution accuracy and speed, and demonstrates strong cross-environment generalization. Moreover, it achieves real-time inference with an average latency of under 80 ms—enabling practical deployment in time-critical embodied AI systems.

Technology Category

Application Category

📝 Abstract

The Vision-Language-Action (VLA) models have demonstrated remarkable performance on embodied tasks and shown promising potential for real-world applications. However, current VLAs still struggle to produce consistent and precise target-oriented actions, as they often generate redundant or unstable motions along trajectories, limiting their applicability in time-sensitive scenarios.In this work, we attribute these redundant actions to the spatially uniform perception field of existing VLAs, which causes them to be distracted by target-irrelevant objects, especially in complex environments.To address this issue, we propose an efficient PosA-VLA framework that anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions. The pose-conditioned anchor attention mechanism enables the model to better align instruction semantics with actionable visual cues, thereby improving action generation precision and efficiency. Moreover, our framework adopts a lightweight architecture and requires no auxiliary perception modules (e.g., segmentation or grounding networks), ensuring efficient inference. Extensive experiments verify that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks and shows robust generalization in a variety of challenging environments.

Problem

Research questions and friction points this paper is trying to address.

Improves action precision in Vision-Language-Action models

Reduces redundant motions in robotic manipulation tasks

Enhances attention alignment with task-relevant visual cues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pose-conditioned anchor attention for visual guidance

Lightweight architecture without auxiliary perception modules

Improved action precision and efficiency in complex environments

🔎 Similar Papers

Keypoints-Integrated Instruction-Following Data Generation for Enhanced Human Pose Understanding in Multimodal Models