Action with Visual Primitives

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the inefficiency and limited generalization of existing vision-language-action (VLA) models, which tightly couple instruction understanding, scene perception, and motor control into a single objective, redundantly relearning capabilities already present in pretrained vision-language models. To overcome this, the authors propose the Action via Visual Primitives (AVP) architecture, which introduces a novel visual-primitive-centric intermediate interface. In AVP, a vision-language model interprets high-level task instructions and generates visual primitive tokens that guide a flow-matching-based action expert to produce low-level motor commands, thereby decoupling semantic reasoning from action execution. This approach substantially improves data efficiency, spatial compositional generalization, and cross-object transfer. Evaluated on real-world robotic pick-and-place tasks, AVP achieves a 27.61% higher success rate compared to pi_0.5.

📝 Abstract

Vision-Language-Action (VLA) models have emerged as a promising paradigm for generalist robotic manipulation. A common design in current architectures maps language instructions and visual observations to actions in a single forward pass. While conceptually simple, this formulation entangles instruction comprehension, spatial scene understanding, and motor control within a single learning objective. As a result, the action expert must implicitly relearn cognitive and perceptual capabilities already present in the pretrained VLM, which can limit both learning efficiency and generalization. We introduce AVP (Action with Visual Primitives), an end-to-end architecture that implements this visual-primitive-centric interface: the VLM infers the next-stage target and emits visual-primitive tokens that condition a flow-matching action expert, with supervision derived from end-effector kinematics. Real-robot experiments on general pick-and-place tasks show that AVP improves the success rate by 27.61% over pi_0.5 and outperforms other recent methods, with consistent gains in data efficiency, spatial-compositional generalization, and object-level transfer.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

robotic manipulation

visual primitives

generalization

learning efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Primitives

Vision-Language-Action

Flow Matching