Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address the inefficiency in training and deployment of Vision-Language-Action (VLA) models caused by costly visual tokenization, this paper proposes Oat-VLA—the first VLA tokenization framework incorporating object-agent-centric representation. Oat-VLA integrates a pre-trained vision-language model with lightweight object detection and agent pose estimation modules to retain only visual tokens from task-relevant regions: key objects and the agent itself. This reduces token count by ~60–80% while preserving or improving task performance. On the LIBERO benchmark, Oat-VLA converges over twice as fast as OpenVLA; it also achieves superior generalization in real-world robotic pick-and-place tasks. The core contribution is the novel integration of joint object-agent-centric modeling into the VLA visual tokenization paradigm, effectively balancing computational efficiency and downstream performance.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models offer a pivotal approach to learning robotic manipulation at scale by repurposing large pre-trained Vision-Language-Models (VLM) to output robotic actions. However, adapting VLMs for robotic domains comes with an unnecessarily high computational cost, which we attribute to the tokenization scheme of visual inputs. In this work, we aim to enable efficient VLA training by proposing Oat-VLA, an Object-Agent-centric Tokenization for VLAs. Building on the insights of object-centric representation learning, our method introduces an inductive bias towards scene objects and the agent's own visual information. As a result, we find that Oat-VLA can drastically reduce the number of visual tokens to just a few tokens without sacrificing performance. We reveal that Oat-VLA converges at least twice as fast as OpenVLA on the LIBERO suite, as well as outperform OpenVLA in diverse real-world pick and place tasks.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs in Vision-Language-Action models

Improving tokenization efficiency for robotic manipulation tasks

Accelerating model convergence while maintaining performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-agent-centric tokenization reduces visual tokens

Inductive bias towards scene objects and agent

Faster convergence without performance loss

🔎 Similar Papers

No similar papers found.