TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models

πŸ“… 2026-03-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing vision-language-action (VLA) policies often suffer from grasping errors in cluttered scenes due to misidentification of target objects, leading to near-misses or unintended grasps. To address this, this work proposes a Target-Aware Guidance (TAG) mechanism that, during inference, compares the original observation with a version where the target object is erased, generating a residual guidance signal to reinforce the policy’s reliance on genuine target evidence. TAG introduces, for the first time, a target-agnostic guidance paradigm that requires no architectural modifications and is compatible with existing VLA policies. It leverages classifier-free guidance (CFG) principles to implement object erasure and signal generation. Evaluated on benchmarks including LIBERO, LIBERO-Plus, and VLABench, TAG substantially enhances robustness to visual distractions and effectively mitigates both near-miss and erroneous grasping behaviors.

Technology Category

Application Category

πŸ“ Abstract
Vision--Language--Action (VLA) policies have shown strong progress in mapping language instructions and visual observations to robotic actions, yet their reliability degrades in cluttered scenes with distractors. By analyzing failure cases, we find that many errors do not arise from infeasible motions, but from instance-level grounding failures: the policy often produces a plausible grasp trajectory that lands slightly off-target or even on the wrong object instance. To address this issue, we propose TAG (Target-Agnostic Guidance), a simple inference-time guidance mechanism that explicitly reduces distractor- and appearance-induced bias in VLA policies. Inspired by classifier-free guidance (CFG), TAG contrasts policy predictions under the original observation and an object-erased observation, and uses their difference as a residual steering signal that strengthens the influence of object evidence in the decision process. TAG does not require modifying the policy architecture and can be integrated with existing VLA policies with minimal training and inference changes. We evaluate TAG on standard manipulation benchmarks, including LIBERO, LIBERO-Plus, and VLABench, where it consistently improves robustness under clutter and reduces near-miss and wrong-object executions.
Problem

Research questions and friction points this paper is trying to address.

object grounding
vision-language-action models
cluttered scenes
distractor bias
instance-level errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Target-Agnostic Guidance
Vision-Language-Action Models
Object-Centric Inference
Classifier-Free Guidance
Robotic Manipulation
πŸ”Ž Similar Papers
No similar papers found.
J
Jiaying Zhou
Sun Yat-sen University
Zhihao Zhan
Zhihao Zhan
TopXGun Robotics
SLAMSpatial AIRobotics
R
Ruifeng Zhai
Sun Yat-sen University
Q
Qinhan Lyu
Sun Yat-sen University
H
Hao Liu
Sun Yat-sen University
K
Keze Wang
Sun Yat-sen University, Guangdong Key Lab of Big Data Analysis & Processing, X-Era AI Lab
Liang Lin
Liang Lin
Fellow of IEEE/IAPR, Professor of Computer Science, Sun Yat-sen University
Embodied AICausal Inference and LearningMultimodal Data Analysis
Guangrun Wang
Guangrun Wang
University of Oxford; AI Research Team at Aistetic
Machine LearningGeneral Intelligence Theory and Application