IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance

๐Ÿ“… 2026-01-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitation of existing vision-language-action (VLA) models, which flatten image patches into one-dimensional token sequences and thereby discard crucial two-dimensional spatial structure. The authors propose a lightweight, training-free inference-time intervention that dynamically realigns visual-token relationships within the language model at the layer where instance-level features reside, leveraging the built-in affinity cues from the visual encoderโ€”without modifying model parameters or adding external modules. This approach enhances spatial awareness while maintaining cross-architecture compatibility across diverse VLA models and tasks. It achieves consistent performance gains on both 2D/3D simulated and real-world robotic benchmarks, including a 4.2% absolute improvement in success rate on low-data VIMA and an increase from 96.3% to 97.1% on the high-precision LIBERO benchmark.

Technology Category

Application Category

๐Ÿ“ Abstract
Many Vision-Language-Action (VLA) models flatten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spatial understanding by exploiting affinity hints already available in the model's built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside. This inference-time intervention realigns visual-token interactions and better preserves geometric structure while keeping all model parameters fixed. We demonstrate the generality of IVRA by applying it to diverse VLA architectures (LLaRA, OpenVLA, and FLOWER) across simulated benchmarks spanning both 2D and 3D manipulation (VIMA and LIBERO) and on various real-robot tasks. On 2D VIMA, IVRA improves average success by +4.2% over the baseline LLaRA in a low-data regime. On 3D LIBERO, it yields consistent gains over the OpenVLA and FLOWER baselines, including improvements when baseline accuracy is near saturation (96.3% to 97.1%). All code and models will be released publicly. Visualizations are available at: jongwoopark7978.github.io/IVRA
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
spatial cues
visual-token relations
robot action policy
geometric structure
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
visual-token relations
spatial reasoning
vision-language-action
affinity guidance
๐Ÿ”Ž Similar Papers
No similar papers found.